联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-03-18 10:08

CS/INFO 5304 Assignment 1: Data Preparation

Credit: 94 points

Grade: 20% of final grade

Submission: Files that need to be submitted for are listed at the end of each question.

Please submit the assignment on Gradescope

Tip: Read the document fully before you begin working on the assignment

Due date: March 28th, 11:59PM

Question 1: Extract, Transform Load (ETL) (34 points)

Often, we are given different datasets that represent the same population, but each dataset

contains different information about this population. ETL pipelines can be used to extract each

data source, transform the data sources so we can map one dataset onto the other, and then

load them into a single data source that we can use for analytics.

In this question, we will create an ETL pipeline to combine two raw datasets from a sampled

population, and estimate the value of a parameter that exists in one dataset, but not the other.

The datasets we are looking at come from the U.S. Center for Disease Control (CDC). The first

dataset is called the Behavioral Risk Factor Surveillance Survey, or BRFSS, and the second

dataset is called the National Health Interview Survey, or NHIS. Within the question you will be

asked to perform the following data analysis:

● Extract the raw datasets into a spark SQL DataFrame

● Read the guidebooks for each of these datasets to understand how to map similar

features to each other

● Perform an exact join on these features, and approximate a disease prevalence statistic

within the US population

For this question, you are required to use Spark and will heavily use the Spark SQL API. You

may use any other Python functions/objects when performing map/reduce on sections of the

DataFrame. We recommend reading this entire section before beginning coding, so you can

get an idea about the direction of the problem.

Step 1: Load data (3 points)

Download data from Google Drive(a smaller dataset than the real one). There should be two

data files, called brfss_input.json and nhis_input.csv. We have also provided starter code

CS/INFO 5304 Assignment 1

for you, in p1.py. Complete the function called create_dataframe that takes in a path to a

file, the type of file to upload (e.g. “csv”), a spark session, and returns the spark DataFrame for

that file.

Step 2: Make test example (6 points)

When working with large datasets, it is often helpful to create a small subset as test example

that can be used to validate whether our code works, within a short runtime.

Analyze the columns of the BRFSS and NHIS data you downloaded, and identify what columns

represent similar information (for example, is there a column in BRFSS, and a respective

column in NHIS, that represent a person’s age?). To do this, you will need the codebooks for

each dataset, which can be found here:

● 2017 BRFSS codebook

● 2017 Sample Adult File NHIS codebook (Here is a copy on Drive in case you have any issues.)

After analyzing the columns, prepare three files that can be used as a test case:

● Download the 5-row BRFSS “dummy” dataset and the 5-row NHIS “dummy” dataset

from Google Drive. They are named test_xxx.

● Manually create a joined_test.csv file that represents the expected output of your

program with the above dummy input. Only exact matches should be kept, and all null’s

should be dropped. This file should essentially have:

a. All the BRFSS columns

b. The column(s) of the NHIS dataset that are not initially within the BRFSS dataset

You could use multiple columns to join. Do read the codebook to know what does each

value represent.

To map the values of NHIS data onto the BRFSS data, for example, IMPRACE = 4 = American

Indian/Alaskan Native, non-Hispanic; MRACBPI2 = 3 = Indian (American) (includes Eskimo,

Aleut); Map them to same value before joining.

Another example, they both have a column for age. If I had a three row BRFSS dataset with one

column representing age, and a two row NHIS dataset with two columns representing age and

gender, the expected joined DataFrame would look like the following:

BRFSS Age

Column

+

NHIS Age

Column

NHIS

Gender

=

Joined BRFSS Age Created BRFSS Gender

18-24 63 1 18-24 2

85+ 20 2 60-64 1

60-64 - - - -

CS/INFO 5304 Assignment 1

Step 3: Map data (12 points)

Next, based on the practice in step 2, you will create a function that maps the NHIS data onto

the BRFSS data. In other words, if you have columns in each dataset that represent similar

information, but with potentially different column names and values, we want you to write a

function that transforms the matching columns within the NHIS DataFrame to resemble the

BRFSS DataFrame.

Please complete the function transform_nhis_data(nhis_df) that inputs the NHIS

DataFrame, and returns a DataFrame where matching NHIS/BRFSS columns have been

mapped to BRFSS formatting.

Step 4: Join data (3 points)

Now that you have transformed columns in NHIS, we can actually join the two dataframes

together into one entire dataset. Please join together your two Spark DataFrames in function ,

such that only exact matches remain. (Hint: The end number of columns should be the |# of

BRFSS columns| +1). Please also drop any rows containing null values (this is not

necessarily correct to do in practice, but since we’re focused on ETL, we’ll save missing data

issues to the next problem).

To clarify, you would have mapped a new column onto your BRFSS dataset. This column

represents the prevalence of a specific disease throughout the U.S. (look at the codebook to

find out what disease!).

You can test and compare with your small data created in Step 2.

(Hint: there are 3 columns in both NHIS and BRFSS representing the same demographic

features. Join using all of them and assume each combination is unique while joining, i.e. there

is no duplication. Is the no duplication assumption correct? Do not need to write answer for

this, but think about this may be helpful for answering part 5.)

Step 5: Report prevalence (10 points)

Based on the joined data, report the prevalence of this disease by:

● Race and ethnic background

CS/INFO 5304 Assignment 1

● Gender

● BRFSS categorical age

Complete function report_summary_stats(joined_df) that finds these summary

statistics. The function should input your joined DataFrame, and does not need to have a return

value (you could just print the stats and record them):

In your write-up:

- record the found prevalence you calculated for each category

- research what the actual prevalence is

- write a short paragraph comparing your found prevalence within the joined dataset to

the actual prevalence, by gender, race/ethnic background, and age

- assess how you might improve the prevalence you calculated.

- Note: Does each row in the BRFSS dataset correspond to a single person in the

U.S.? Check-out the BRFSS documentation.

Turning in:

For this question, please turn in the following:

● Your code, in a file called p1.py

● The joined_test.csv of your created test case in “csv” format

● Your p1_write_up.pdf including the answer to step 5 as well as any documentation to

run your code.

● Please only submit the above files and do not zip your files

Note: You are not submitting the joined DataFrame for the actual BRFSS/NHIS data, just

reporting the prevalence statistics.

We should be able to run your p1.py file, from the command line using the command:

spark-submit p1.py /path/to/brfss.json /path/to/nhis.csv -o path_to_output

The arguments are as follows:

● /path/to/brfss.json: The path to an inputted BRFSS file

● /path/to/nhis.csv: The path to an inputted NHIS file

● path_to_output: Save the resulting joined DataFrame.

Please document any specific information we need to run your code/understand your outputs in

your written report. If you use a different coding language, document how to run your code.

CS/INFO 5304 Assignment 1

Question 2: Dealing with messy and missing data (40 points)

In this question, we will perform data cleaning and data filling with human-generated sensor

data. The intention is to let you practice exploratory data analysis using real-world data, making

sense of the data while cleaning it. The form of the submission is a report containing your

results and your reasoning.

We will be using a modified version of the Extrasensory dataset from UCSD. Here is the link to

it. In short, this dataset is from a study to perform context recognition with smartphone and

smartwatch sensors. Filling in data often requires domain expertise, and we choose this dataset

because we are certain all of you are domain experts at using a smart phone. :)

In this Question, you can use Pandas or Pandas API on Spark to load the data to Pandas

Dataframe or Spark dataframe respectively. All the functions in the pandas library are allowed.

Numpy, Scipy.stats, .etc are also allowed.

A JupyterNotebook template to answer this question is also in the data folder.

Understanding data structure

You are provided a dict of the modified dataset. There are 60 participants in this dataset. We

give you two pickle files: Extrasensory_individual_data.p and

Extrasensory_sensor_data.p.Please refer here on how to open pickle files.

Extrasensory_individual_data.p contains a DataFrame with the participant ID and the

following data:

Sensor Feature

demographic age

gender

phone_system

Phone usage self_reported_average_screen_time

actual_average_screen_time

hours_of_study

Extrasensory_sensor_data contains a dict with participant ID as the key of the provided

dict. The value contains a matrix of ~ 2000 - 6000 recorded samples of 10 features. The

number of records corresponds to the number of minutes that the participant underwent the

study. It contains the data from the following sensors:

CS/INFO 5304 Assignment 1

Sensor Feature

accelerometer raw_acc:3d:mean_x

raw_acc:3d:mean_y

raw_acc:3d:mean_z

location location:raw_latitude

location:raw_longitude

phone state discrete:app_state:is_active

discrete:app_state:is_inactive

discrete:app_state:is_background

discrete:app_state:missing

lf_measurements:battery_level

As with all studies that are deployed to humans in the wild, there's a lot of missing data. In this

dataset, some of the missing values are recorded as NaN or are left blank. Some other

missing values are recorded as -1 for non-negative values.

Please use one jupyterNotebook(template giving) to write your answers to all the

questions of Q2. Please include all the figures and code in the notebook.

Case 1: Actual screen time (15 points)

This study recorded the average screen time every 24 hours for each participant. Some data

are missing due to app permissions. Because this is an important reference for analysis later,

we need to fill in those missing values.

A) Observe this data or plot a simple plot. How are missing values represented for this

feature?

B) Ignore the missing values for now (i.e. mute them or temporarily remove them, so that

they don’t affect this analysis). Provide a histogram to illustrate this distribution of

actual_average_screen_time.

a) Does it have outliers?If so, how many?

An outlier is defined as being any point of data that lies over 1.5 IQRs below the first

quartile (Q1) or above the third quartile (Q3)in a data set.

High = (Q3) + 1.5 IQR

Low = (Q1) – 1.5 IQR

b) Is it skewed? If so, is it left skewed or right skewed? What’s the skewness?

C) Fix the missing data problem using these methods: 1) filling with median, 2) filling with mean,

3) filling with a random value within range of your choice. Reproduce the distribution figure

for this feature filled by all three methods. Overlay a figure with the original distribution and

the 3 new distributions. How did you choose the random value from method 3)? What do the

distributions look like after you implement the three filling methods? (Compare them)

CS/INFO 5304 Assignment 1

D) We could compare your filled distribution with the real distribution. Some Research shows

that American adults have an average screen time of 3.85 hours per day, with a standard

deviation of 1.25 hours. You could generate this distribution by:

np.random.normal(mean, std, number of samples)

Perform a t-test between the three distributions you filled in part C and this population

distribution. Report the three p-values. Which one of the filling methods reconstruct this

feature to be closest to the research distribution? Why do you think this is the case?

Case 2: Perceived average screen time (10 points)

In this study, the participants were asked to self-report how many hours they spend on their

phone. Participants could choose not to report if they do not know the number of hours, or if

they did not want to.

Let’s find out whether these missing values are MAR(missing at random) or MNAR(missing not

at random). Said another way, let’s see if participants did not report these values because they

do not know their screen time, or they did not report because they are self-conscious that they

might have spent too much time.

Let’s see if there’s a correlation between the actual_average_screen_time and

perceived_average_screen_time.

recap:

actual_average_screen_time is the screen time recorded by phone and can be missing due

to permission or technical reason

perceived_average_screen_time is the screen time recorded by the user itself and can be

missing or wrongly recorded for various reasons.b

A) Temporarily ignore the missing values. Plot a histogram of the perceived screen time.

a) Does it have outliers?If so, how many?

An outlier is defined as being any point of data that lies over 1.5 IQRs below the first

quartile (Q1) or above the third quartile (Q3)in a data set.

High = (Q3) + 1.5 IQR

Low = (Q1) – 1.5 IQR

b) Is it skewed? If so, is it left skewed or right skewed? What’s the skewness?

B) Let’s define an intense phone user as someone whose average screen time is at least

one standard deviation larger than the mean screen usage time. How many of them are

intense phone users? (Note: Do not remove outliers) (Note: think of which data should

be using actual_average_screen_time or perceived_average_screen_time)

C) Create two binomial distributions of A) missing perceived_average_screen_time

and B) intense phone users. In another word, generate a boolean array for A) and B).

Perform a Chi-square test on these two distributions. What is the p-value? Do you think they

are correlated? What does this mean? Do you think this feature is MAR or MNAR? (Note and

CS/INFO 5304 Assignment 1

hint: If the user’s actual screen time is missing, you should not count that user as either

intensive or non-intensive, you should filter out those users)

Case 3: Location (15 points)

Now let’s look at the stream for location data. Each participant has two sets of location data,

location:raw_latitude and location:raw_longitude. They are very similar, so for

ease sake we will only look at location:raw_latitude.

There are a two thoughts on why location data might be missing: 1) app recording

errors/technical errors, which are random or 2) some people turn off location services to save

battery when the battery level is low. We assume 2) is a consistent behavior. Let’s say people

configure their phone to turn off location service if the battery is below 20%.

A) (8 points) Identify these people who behave as 2) and report their uuid. Their

characteristics should be that their location data are consistently lost when their battery

level is below 20%. We need to find these people because if this was a real-world

analysis, their location traces give noise to downstream classification.

Explain how you find these people in the Notebook. For each of these individuals, how

many minutes of location data have we lost due to turning-off of location service?

Note: This is a half open ended question. There’s no standard solution for this, your

solution can be loose but you need to explain your solution.

Now, we will only proceed with people with uninterrupted location traces. Find subject

F50235E0-DD67-4F2A-B00B-1F31ADA998B9. We’ll start to do some filling for time series

here.

We introduce the following naive methods:

● Forward filling: Assume the missing value occurs at tn, fill in the value using tn-1.

● Backward filling: Assume the missing value occurs at tn, fill in the value using tn+1.

● Linear interpolation: For each chunk of missing values, perform linear interpolation with

the value of one sample before the missing chunk, and one value after the missing

chunk. (If you want to implement more sophisticated linear interpolation, note it in your

write-up).

B) (7 points) Perform a forward filling, backward filling and linear interpolation on the

location:raw_latitude trace. Produce a figure with 4 traces overlay on each other

(4 traces including the original, unfilled trace and trace produced by your three filling

methods). Compare the 4 traces. What do you see? If you were to use this dataset for

further analysis, which filling method will you choose?

CS/INFO 5304 Assignment 1

Turning it in:

● Your write-up, including answers to the above questions and figures to help

support your answers. One Jupyter Notebook file(q2.ipynb) that includes

write-up and code and figures is required.

Question 3: Outlier Detection (10 points)

Your task is to analyze the data and detect the outliers in the dataset. For this activity, let us use

the following dataset given below. You do not need to use Spark for this question. You can

use Numpy, Pandas, etc. Please wrap up your code and output graphs in one Jupyter

Notebook.

Input Dataset:

● Top 270 Computer Science / Programming Books. High rated book information in the

field of computer science and programing

● Download link

Task 1: Univariate Outlier detection (4 points)

Box plots are efficient in identifying the outliers in numerical values. For this dataset, use the

IQR method to identify any univariate outliers in this dataset. Plot the outliers in the context of the

feature's distribution for all the four numerical features (Rating, reviews, number of pages,

price). Your output should be the box plots of each feature.

Task 2: Multivariate Outlier detection (6 points)

Lets use the DBSCAN Method for multivariate outlier detection. For this exercise, use the

numerical and categorical columns to fit DBSCAN.

Your first task is to perform a bivariate analysis on all possible pairs of the above features and

identify any outliers. The output should be all the DBSCAN plots and the outlier data

row(index&value) for each combination.

Your second task is to look for all combinations of three variables in the above dataset to identify

multivariate outliers. The output should again be all the dbscan plots(3D) and the outlier data

row for all combinations.

CS/INFO 5304 Assignment 1

Use the following hints for answering this question:

● For categorical variables(“Types”), you can convert them into numerical features for the

analysis.

● Try scaling the data first.

● Tune the eps and min_samples.

Turning in

1) A Jupyter Notebook(q3.ipynb) with all code, output and graphs.

2) Please export your q3.ipynb to a q3.py and upload this q3.py as well. We will need to run

a plagiarism check on the code. (do not zip these two files)

Question 4: Data Visualization (10 points)

Your task is to design a visualization that you believe effectively communicates the data and

provide a short write-up describing your design. You can use any data visualization tool you

want.

Start with the dataset weather.csv which contains weather measurements nearest to Cornell

Tech every day from 1950 to the present.

Notice that the temperature (Ktemp) is in Kelvin rather than in Fahrenheit. Define a variable that

expresses the temperature in Fahrenheit, using the following formula:

Ftemp = (Ktemp ? 273.15) * (9/5) + 32

Part A) For every month of the year, plot the average temperature (in Fahrenheit) using a

scatter plot or line plot. Your visual should be configurable so that a user can easily look at a

specific year’s weather curve (use a sliding scale filter). (6 points)

Part B) Based on all of the historical data, when is the first day where the historical average

temperature passes 60 degrees (when will Cornell Tech finally be warm again?) (2 points)

Part C) Create a new sheet where you do something creative through data visualization.

Express something about the temperature over time(e.g. Look in the cycle of temperature over

seasons, etc ) using this dataset, or add a new dataset(any available dataset online is fine) and

find some correlation with the temperature(e.g. Number of some kind of fish in the ocean, etc,

does the number of it go up and down following the temperature trend? Etc.). (2 points)

You will be graded based on the quality of completion. A simple plot of temperature in Celsius

will not get you full points.

CS/INFO 5304 Assignment 1

Note: Your visualization should be interpretable and accompanied by a short write-up (you do

not need to write more than a sentence or two). Do not forget to include title, axis labels, or

legends as needed!

Turning it in:

1. Export everything to one pdf file q4.pdf and submit it in gradescope. If there’s any

dynamic visualization, please include the URL to the visualization in the pdf.

(consider graders as newspaper readers, do not expect graders to have all the

tools installed. For example, if using tableau, do not submit the .twb file, but push

the visualization to tableau public and include an url)


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp