TwitWords (Scraping data using the Twitter API and Python)

Last update: 6/20/2018

This is an on-going project. It starts from scratch at the top of this post and the progress is documented chronologically. To view the most recent git-hub version of the project click here

I figured it would be a good idea to showcase some sort of code ability with this website as well. Because webscraping has been my most recent endeavor, I’ve decided to carry that on to my first project.  Currently, I’m thinking of a script that maybe pulls trumps tweets for the day and stores each word and how many times it was used. Then adds to a running total type list of all the words he has used.  This data could be an interesting way to see what the POTUS’ most popular vocabulary is!

This is what Ive started so far. Ive decided on using the “Tweepy” Library (Docs link) to interact with twitters API. Currently logs in the twitter API and pulls the POTUS’ most recent 10 tweets. It then splits each tweet into an list of strings. I plan to count each word in each tweet to get a running total of words used.

It’s a start to the project, but a long way to go.

Collections Counter Library

I forgot I’m working nights this week, so I continued working on the program. 

After looking around on google I came across another library called Collections, which uses “Counter Objects”. The library also supports adding new Counters to old Counters…so…It looks like most of the hard work is done for me. I assume the library looks at each character in each string in the list and compares it to other sequences of characters in the list etc, but I did not delve much into the documentation of the library. However, it can be viewed here.

Cool. So here Ive loaded up Trumps 10 most recent tweets. Ive scanned each tweet’s contents for individual words, and the frequency at which each word appears. As seen in the picture below to the right there are a TON of uninteresting words in the Counter Object being printed.

I think the next step would be to create a list of Pronouns and 50 of the most common English Prepositions and work on a way to exclude them from being passed into the Counter Object. This would leave me with some more interesting data, such as the words underlined in red.

Removing unwanted data

So the next day began fighting over the necessity of taking the special characters out of the words. I opted not to because it kept causing me issues with Counter objects. I think it has something to do with how Collections stores its variables and stripping the special characters strips the way the Counter delimiter.

I ultimately scrapped doing that at this stage because of that issue. It was a preemptive measure anyways it doesn’t appear to have much of an effect on the data I’m seeing.

Anyway- To complete something that actually acquired some mildly interesting data I needed to reduce the uninteresting words from his tweets. I googled the 70 most common English prepositions, made a list and added a few pronouns and other non-essential words.

So to conclude, All I had to do was loop through each element in the Counter Object that contained the words from POTUS’ tweets, and if any of those words were in my omitList, then I simply delete that word from my tweet list.

Re-writing

So after working through that I decided to explore what other options were available that included built-in natural language processing. Through that search I came across a really neat library called NLTK. After doing a few of the example projects using the NLTK library I decided to utilize it in my project. This simplified getting the data I was already getting by presenting built in NLTK functions (No longer need to create my own omitList, can use and append to the one generated by NLTK library.) I cleaned up my code, began to function-ize everything and also added a way to keep track of what tweets have already been parsed. The result is pretty similar to the data from the program I had been working on (as seen above) however it’s readability has increased greatly and its possible capabilities are much more exciting.

Re-writing again, with graphing functionality

Setting Goals

I’ve mashed together some code snippets and made the necessary configurations to serve this information with the Dash web framework at app.ryanpskiles.com. It works as a rudimentary idea, but I quickly realized that without a particular goal and plan it was beginning to be difficult to understand my code. I spent some time setting up a pathway to “completing” the web app I had set out to develop.

Read tweets of user input handle(input from text field in Dash)
– return a string variable to hold all the tweets as one variable

Read the String of Tweets with NLTK, sorting by simple part of speech tags
` Return dictionary (of dictionaries) Dict{key: POS TYPE, value:Dict{key:WORDS, value:Frequency}}

graph with dash based on radio button user input (single graph at a time for now)
– read user radio button input to determine which word/frequency dictionary to
grab from the dictionary of POS types.
– graph using dash
title = POS type for (user input twitter handle)
x axis = words
y axis = frequency
– refresh page content when radio button is switched

Leave a Comment