联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2021-03-29 09:36

Homework Assignment #3 (Capturing and Wrangling Twitter Data)

Your next assignment is to become something of an expert on acquiring, saving, and wrangling large volumes of social media data using Twitter APIs and Tweets as an example.

Get Ready...

In addition to PostgreSQL (again) and the mainstream components of Anaconda, you will be making use of the Twitter developer APIs and a popular supporting Python library called Tweepy (http://www.tweepy.org/) for this assignment. Grab Tweepy if you don’t already have it and read up on it a bit. You will also be using some of the techniques covered in class for doing non-query things with PostgreSQL, so make sure that you are comfortable with those. Lastly, to use Tweepy, you will need to have a Twitter account - so it’s time to get yourself one if you don’t already have one. (Since all of the CS170A course projects will likely include a social media component, consider this another investment in project preparation as well - so even if getting a Twitter account hasn’t been high on your to-do list in the past, it’s time for you to join the leaders of countries all over the world and start Tweeting!)

Get (Data) Set...

For this assignment, you are going to be acquiring, archiving, and analyzing Tweets. Because your boss envisions doing this for a LOT of Tweets, you’re going to do this the scalable way - you will be storing and manipulating your Tweets in PostgreSQL, using Python and Tweepy just as a means to that end. Your boss is looking for ways to help Wholoo.com overtake Netflix.com, and wants you to explore the online buzz about two recent Netflix shows, Bridgerton and The Queen's Gambit.


To get started, think of a few keywords that might appear in Tweets about these shows, and play around with Twitter’s interactive search interface (https://twitter.com/search-home) to see what those words turn up (e.g. https://twitter.com/search?q=bridgerton%20netflix). Iterate a bit until you find a search query that yields Tweets whose results you would like to analyze. This is also a great way to get familiar with the query syntax for Twitter searches, so feel free to continue exploring the search capabilities this way. More info about Twitter’s search operators can be found here. Additional note, Twitter has many Tweeters who are not human (bots). You might want to mitigate that by refining your search method (e.g., changing the ‘result_type’ from recent to popular or vice versa).

Go..!

It’s time to get to work. Tackle each of the following questions and problems using Tweepy to access the Twitter APIs and using PostgreSQL as your social data storage and query platform:

1.Design a schema, and a set of CREATE TABLE statements, for storing the data that you are going to gather. Tweets contain a number of different kinds of information, both at the top level and nested within, including information about a given Tweet (e.g., its text), information about the Tweet’s content (e.g., hashtags), static info about the Tweet’s user (e.g., their screen name), and dynamic info about the user (e.g., how many times they have Tweeted). (See https://developer.twitter.com/en/docs/tweets/data-dictionary). You will be capturing your Tweet data in a set of 1NF tables by collecting the Tweets with Tweepy and then “splitting them” up by writing their pieces into these tables. Create interrelated tables to hold the following information:

a.Tweet information tables BridgertonTweetInfo and QueenTweetInfo, including columns with names created_at, tweetid, userId, text, and in_reply_to_status_id. Also include a raw_tweet field in your design that captures the original Tweet (in case you realize later on that there’s more information that you’d like to have from the Tweets).

b.One static user information table StaticUserInfo including columns with names userid, username, screen_name, and userdescription.

c.One dynamic user information table DynamicUserInfo, including columns with names userid, statuses_count and followers_count.

d.One tweet entity information table TweetEntity, with columns tweetid and hashtag. The hashtag column contains the hashtags in that tweet. Note that the table should be in 1NF.


2.Write a small Python program that uses your search query to gather two sets of at least 5,000 Tweets and stores them in your PostgreSQL tables. One set should be Tweets about the Bridgerton show, and the other set should be Tweets about the Queen show. Start by testing your program on small sets of Tweets (say 10-100 or so) and then, once everything seems to be working, run it “at scale” to gather your two sets of 5,000 Tweets. Beware of Twitter’s rate limiting as you do this - your program will need to be respectful of that and will have to run for a while. You will want to get this step of the assignment done 1-2 days before the deadline to leave yourself sufficient time for the social analytics part of this HW assignment! (Seriously!)


3.Working (only) with SQL and PostgreSQL, do some simple analysis on your collection of Tweets - record your work and its results (e.g., by creating a script file that you can run again when you are done exploring your data).

a.What are the top 100 most frequent words appearing in each of your collections of Tweets and what is the occurrence rate of each word in each? What sentiment can you infer by just looking at the words (negative, positive, neutral)? (Hint: Recall that you can tokenize a text field in Postgres using string_to_array.)

b.How many different hashtags do your Tweet collections have overall?

c.What is the average number of hashtags per Tweeting user in each Tweet collection?

d.What hashtags are common to Tweets about both of the Netflix shows?

e.For the top 10 hashtags common to both shows, what are the associated occurrence frequencies for each show?

f.What users (if any) are common to both of your datasets?

g.Who are the top 10 Tweeters (include their user id, user name, screen name, and description) by Tweet volume within your combined dataset-- and what are their Tweet counts and their last known followers’ counts? You might want to create indexes on tweet(user_id) on both datasets to accelerate your query execution time.

h.In order to speed your work, you should create a GIN (generalized inverted index for text indexing) on the text field of your main Tweets table.

i.One (or both) of the shows may have sparked multiple events that are reflected in social media. Iteratively, try to query your data and look for one of the events that went viral, and form a set of keywords that are unique to this event. Use your keyword set to look for tweets that have one or many of those keywords. (That’s where the index can help.) Briefly explain your findings. (Note: This is an open-ended question and it has no right or wrong answer. However, it is a step closer to what you will do in your project.)



What To Turn In

When you have finished your assignment you should use Gradescope to turn in one PDF file that shows your Jupyter notebook work for Part 2 as well as your SQL code and output for Parts 1 and 3. When submitting to Gradescope, be sure to mark which pages correspond to which questions; you may lose credit if this is not done correctly.

Additionally, you should submit both your Jupyter and SQL code files to Canvas.


相关文章

【上一篇】:到头了
【下一篇】:没有了

版权所有:编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。