联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2018-12-16 09:46

STAT 7008 - Assignment 2

Due Date by 31 Oct 2018

Question 1 (hashtag analysis)

1. Tweets1.json corresponds to tweets received before a Presidential

debate in 2016 and all other data files correspond to tweets received

immediately after the same Presidential debate. Please download the

files from the following link:

https://transfernow.net/912g42y1vs78

Please write codes to read the data files tweets1.json to tweets5.json

and combine tweets2.json to tweets5.json to a single file, named

tweets2.json. Determine the number of tweets in tweets1.json and

tweets2.json.

2. In order to clean the tweets in each file with a focus on extracting

hashtags, we observe that 'retweeted_status' is another tweet within

a tweet. We select tweets using the following criteria:

- Non-empty number of hashtags either in 'entities' or in its

'retweeted_status'.

- There is a timestamp.

- There is a legitimate location.

- Extract hashtags that were written in English or convert hashtags

that were written partially in English (Ignore non-english

characters).

Write a function to return a dictionary of acceptable tweets, locations

and hashtags for both tweets1.json and the tweets2.json respectively.

3. Write a function to extract the top n tweeted hashtags of a given

hashtag list. Use the function to find the top n tweeted hashtags of the

tweets1.json and the tweets2.json respectively.

4. Write a function to return a data frame which contains the top n

tweeted hashtags of a given hashtag list. The columns in the returned

data frame are 'hashtag' and 'freq'.

5. Use the function to produce a horizontal bar chart of the top n

tweeted hashtags of the tweets1.json and tweets2.json respectively.

6. Find the max time and min time of the tweets1.json and the

tweets2.json respectively.

7. For each interval defined by (min time, max time), divide it into 10

equally spaced periods respectively.

8. For a given collection of tweets, write a function to return a data frame

with two columns, hashtags and their time of creation. Use the

function to produce data frames for the tweets1.json and the

tweets2.json. Use pandas.cut or else, create a third column 'level' in

each data frame which cuts the time of creation by the corresponding

interval obtained in part 7 respectively.

9. Use pandas.pivot or else, create a numpy array or a pandas data frame

whose rows are time period defined in part 7 and whose columns are

hashtags. The entry for the ith time period and jth hashtag is the

number of occurrence of the jth hashtag in ith time period. Fill the

entry without data by zero. Do this for tweets1.json and the

tweets2.json respectively.

10. Following part 9, what is the number of occurrence of hashtag 'trump'

in the sixth period in the tweets1.json? What is the number of

occurrence of hashtag 'trump' in the eighth period in the tweets2.json?

11. Using the tables obtained in part 9, we can also find the total number

of occurrences for each hashtag. Rank these hashtags in decreasing

order and obtain a time plot for the top 20 hashtags in a single graph.

Rescale the size of the graph so that it is not too small nor too large.

Do this for both tweets1.json and the tweets2.json respectively.

12. The zip_codes_states.csv contains city, state, county, latitude and

longitude of US. Read the file.

13. Select tweets in tweets1.json and the tweets2.json with locations only

in the zip_codes_states.csv. Remove also the location 'london'.

14. Find the top 20 tweeted locations in both tweets1.json and the

tweets2.json respectively.

15. Since there are multiple (lon, lat) pairs for each location, write a

function to return the average lon and the average lat of a given

location. Use the function to generate the average lon and the average

lat for every locations in tweets1.json and the tweets2.json.

16. Combine tweets1.json and tweets2.json. Then, create data frames

which contain locations, counts, longitude and latitude in tweets1.json

and the tweets2.json.

17. Using the sharpfile of US states st99_d00 and the help of the website

https://stackoverflow.com/questions/39742305/how-to-usebasemap-python-to-plot-us-with-50-states,

produce the following

graphs.

18. (Optional)

Using polygon patches and the help of the website

https://stackoverflow.com/questions/39742305/how-to-usebasemap-python-to-plot-us-with-50-states,

produce the following

graph.

Question 3 (extract hurricane paths)

The website http://weather.unisys.com provides hurricane paths data from

1850. We work to extract hurricane paths for a given year.

1. Since the link contains the hurricane information varies with years and

the information is contained in multiple pages, we need to know the

starting page and the total number of pages for a given year. What is

the appropriate starting page for year = '2017'?

2. In order to solve the second question, we try inputting a large number

as the number of pages for a given year. Use an appropriate number,

write a function to extract all links each of which holds information on

the hurricanes in '2017'.

3. Some of the collected links provide summary of hurricanes which do

not lead to correct tables. Remove those links.

4. For each valid hurricane link, it contains four set of information:

- Date

- Hurricane classification

- Hurricane name

- A table of hurricane positions over dates

Since the entire information is contained in a text file provided in the

corresponding webpage defined by the link, write a function to

download the text file and read (without saving it to a local directory)

the text file (at this moment, you don’t need to convert the data to

other format).

5. With the downloaded contents, write a function to convert the

contents to a list of dictionaries. Each dictionary in the list contains the

following keys: Date, Category of the hurricane, Name of the hurricane

and a table of information for the hurricane path. Convert the Date in

each dictionary to datetime objects. Since the recorded times for the

hurricane paths used the Z-time, convert it to datetime object with the

help of http://www.theweatherprediction.com/basic/ztime/.

6. We find some missing data in the Wind column of some tables. Since

the classification of a hurricane at a given moment can be found in the

Status column of the same table and the classification also relates to

the wind speed at that moment, use the classification to impute the

missing wind data. You may want to read the following website

https://en.wikipedia.org/wiki/Tropical_cyclone_scales.

7. Plot the hurricane paths of year '2017'size by the wind speed and color

by the classification status.

If you can produce your graph in a creative way, bonus marks will be

given.

8. (Optional)

Convert the above functions as function of year so that when we

change year, you will be able to generate plot of the hurricane paths

in that year easily.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp