Introduction

Devin has ~7,000 tweets from 2013 - 2018 (6 years). This project analyzes all ~7,000 tweets, removing retweets and replies. The tweets are split into a bag of words where a term frequency analysis is performed to find Devin's top words. The same analysis is performed on 23 other Twitter users, for a total of 50,000 analyzed tweets. We trained a model to predict the user that Devin tweets the most similar to in order to detect whether this can predict related goals, aspirations, interests, etc. We also perform a Digram Markov Chain on Devin's tweets in order to predict future tweets.

  • Twitter
  • Markov
  • Python

Problem Statement and Motivation

We wanted to see if it were possible to detect similar people based on their tweets. We also wanted to see if it were possible to accurately guess someone's online personality by their IRL personality.
We thought it would be possible to do so, and we hypothesized that Devin would tweet most similarly to Chrissy Teigen and Typical Girl.

Data Exploration / Cleaning and Machine Learning

We explored different ways to analyze all 50,000 tweets. Ultimately, we used
unigram, bigram and a combination of both in order to detect similarities in language. We found that they produced very different results. Our bigram graph only had 8 comparisons total, suggesting that Devin's tweets had no similarity with the other 15 Twitter users, which is different than what the unigram graph produced. Our unigram and bigram combination had more comparisons, which was expected.
For extraction, We used Twitter's API (Tweepy) to pull the tweets. After extracting and cleaning the tweets, we performed K Nearest Neighbors with an optimal k of 9 to find Devin's most similar tweeters.
For the visuals, we used the Google Visualization API to format and graph the data.


Tweet Language Comparison - Unigram



Tweet Language Comparison - Bigram



Tweet Language Comparison - Unigram and Bigram



Tweet Language 2013-2018


Markov Chain

To predict some Tweets that Devin may make in the future, we used a Digram Markov Chain from thousands of Devin's Tweets.
Here are some of our favorites.

  • Romance

    "of course right when i saw u hitting on a final to pass the class instead of studying ur the most beautiful person 2 me~"

  • Water Vehicles

    "idiot if you mix it with water. i tried calling an uber off a yacht"

  • Relatable

    "yo i relate to how k selected organisms are iteroparous really it's fine you can be a blues fan today"

  • Dreamer

    "most people have dreams of them flying or travelling, etc. i dream about the person who set my phone alarm to 4:45 without me noticing... alright that was so trippy"

  • Mortal Kombat > Friendships

    "kinda feels like i'm not ignoring u, i'm just tryna play mortal kombat"

  • Fortune Teller

    "12 hours until i met you. worst decision of my life"

Conclusion / Reflection

This project was a lot harder than we thought it would be. Cleaning the data ultimately caused a lot of the language to diminish (due to images, videos, etc), thus making it more difficult to accurately detect similarities within tweets.
Overall, our original hypothesis was incorrect. We initially thought that Devin would tweet similar to Chrissy Teigen and Typical Girl because of the way she talks, but it appears that her online presence follows closely to Justin Bieber's. We also found that it is very difficult to determine related goals, aspirations, interests, etc through Twitter language because many people have a very different online presence than who they really are in person.