HW5: Tweet Wrangling (20 Points)

Due Friday 4/3/2020

Overview / Logistics

The purpose of this assignment is to get you practice with Python dictionaries with a very relevant example. You can start with the Twitter.py that we wrote last week and add methods to it. You will be loading in and examining the file trumpSinceElection.dat, which holds a list of Donald Trump's tweets since 2016 in dictionary form.

What to submit: When you are finished, you should submit a file Twitter.py to Canvas with the methods for each task, along with answers to the following as a comment on Canvas:

  • Did you work with a buddy on this assignment? If so, who?
  • Are you using up any grace points to buy lateness days? If so, how many?
  • Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
  • Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  • Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

JSON Alternative To Pickle

Some students have reported issues loading the list of dictionaries with pickle. Since it is just a list of dictionaries with text and numeric keys/values only, it is possible to use a simpler, more universal encoding known as JSON. Click here to download the JSON file. Actually, this link will likely open up the JSON file in your browser, where you can explore the tweets. You will want to switch to "RAW" and save it to your hard drive as trumpSinceElection.json by right clicking and saying "save file as". Then, you can load the file with this code

The Problem

In class, we showed how to process Python dictionaries, and that the Twitter API organizes tweets in dictionary form. In this assignment, you will be digging into Donald Trump's tweets from November 2016 to answer a few questions

Part 1: The kth Most Popular Tweet (6 Pts)

In the video from last week, we showed how to find Trump's most popular tweet by using numpy's argmin function (Click here to review that example). Numpy also has a function called argsort. Look at the documentation for this function, and use it to come up with Trump's kth most popular tweet, as measured by the number of retweets. Put your code in a method called find_kth_popular_tweet(tweets, k). This method should find and print out the dictionary for this tweet. For example, the code should output


  • You sould play around with the argsort function using simple examples that you design by hand, before you apply it to the more complicated scenario with tweets. By default, this method sorts things in ascending order. Somehow, you will need to get them in descending order
  • Be careful with zero-indexing. The 5th most popular tweet would really be at index 4 in a sorted list

Note for the curious

Since we only need the kth largest tweet, technically sorting everything is overkill. For those familiar, sorting N items can be accomplished in O(N log N) steps optimally. However, an operation known as a k-partition can be used to separate out the smallest k elements of a list in only O(N) time. One can use numpy's argpartition method to separate out the maximum k in this fashion. Though getting comfortable with argsort will help you in the next task

Part 2: Top k Most Used Words (7 Pts)

Your next task is to loop through all of the tweets and to print out the top k most commonly used words. Create a method get_k_most_popular_words(tweets, k) to do this. For instance, should print out the following words in order

1 the
2 to
3 and
4 of
5 a
6 in
7 is
8 rt
9 for
10 on
11 that
12 are
13 i
14 will
15 with
16 our
17 be
18 great
19 we
20 have
21 &
22 they
23 it
24 this
25 was
26 you
27 at
28 has
29 he
30 not
31 by
32 president
33 all
34 very
35 as
36 my
37 no
38 just
39 so
40 who
41 from
42 people
43 -
44 thank
45 their
46 democrats
47 but
48 his
49 trump
50 do
51 been
52 an
53 about
54 now
55 new
56 more
57 fake
58 big
59 or
60 what
61 get
62 would
63 many
64 news
65 if
66 than
67 never
68 out
69 there
70 american
71 should
72 up
73 your
74 u.s.
75 @realdonaldtrump
76 want
77 when
78 much
79 united
80 one
81 even
82 @realdonaldtrump:
83 time
84 america
85 being
86 me
87 make
88 were
89 like
90 going
91 good
92 can
93 only
94 which
95 must
96 house
97 impeachment
98 after
99 border
100 had
101 country
102 other
103 doing
104 don’t
105 because
106 media
107 back
108 nothing
109 over
110 into
111 vote
112 how
113 dems
114 state
115 am
116 republican
117 did
118 states
119 working
120 why
To help you out, you should have a loop that looks like this somewhere This splits the text in each tweet into a list of its individual words and puts the words into lowercase so that lowercase and uppercase versions count the as the same word.


  • Let's say, for the sake of argument, that I have the following word_counts dictionary Then, if I say and then I say then now I have a list of all words and a corresponding numpy array of all of the counts. You can then argsort counts and use that to pick out the top k words

Part 3: COVID Tweets (7 Pts)

Make a function plot_coronavirus_timeline(tweets) that loops through all of the tweets in the database and picks out all of the tweets that mention either "corona", "virus", or "covid" in the lowercase version of the 'text' key. Then, it should create a bar chart that shows a bar for each date during which these words were mentioned, with the height of the bar equal to the number of tweets with this mentioned on that particular day.

Since plotting labeled bar charts in matplotlib is not obvious, you may use the starter code below. You simply need to fill in the counts dictionary. You should use the provided get_tweet_date(tweet) to create the key for this dictionary. This function puts the dates into Year/MM/DD format, which ensures that alphabetical is the order in which they occur in time.


  • To check if a string is contained in another string, simply say