联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> OS作业OS作业

日期:2018-09-12 03:15


Fall 2018

AD699: Data Mining for Business Analytics

Topics: Data Exploration and Visualization

A note about submissions: Unlike Olympic figure skating and ski jumping, AD699 does not

award style points. There are some fancy tools within R for generating reports, such as

RMarkdown, but learning them is not within the scope of this course. The most important thing

here is to answer the questions that ask for written answers, and to show screenshots where

screenshots are asked for.

This assignment is due by 11:59 p.m. on Monday, September 17th.

Step 1:

Download this file from our course Blackboard site:

a) athlete_events.csv

Part I: Data Exploration

1. Bring this file into your R environment. Assign the name athletes to this file. Show the

code that you used to do this. (Remember to first set your working directory to the folder

that contains your files).

2. How many rows and how many columns does athletes contain? How do you know this?

3. Are there any missing values in the athletes data set? If so, how do you know this?

(Note: There are MANY ways that you could answer this question, and any valid way is

completely fine).

4. Remove all rows in the athletes data set that contain any missing values, and store the

results of this operation in a new variable called athletes2. What are the dimensions of

athletes2?

5. Based on the data in athletes2, what is the mean age of an Olympic medalist? What is

the median age? Show the code that you used to find this out, along with a screenshot

of your results.

6. How many Olympic medalists were male, and how many were female? (Hint: Use the

table function to help you with this). Show the code that you used to find this out, along

with a screenshot of your results.

7. How old was the youngest Olympic medal winner in the dataset? How old was the oldest

Olympic medal winner in the dataset? Show the code that you used to find this out,

along with a screenshot of your results.

Part II: Data Visualization

1. Filter the dataset so that it only contains information for your particular Olympiad.

Student Olympiad assignments can be found in Blackboard, in the same folder

that contains this assignment prompt. Assign a new variable name to this

dataset that only contains your Olympiad. Show the code that you used to find

this out, along with a screenshot of your results.

2. Using ggplot, create a histogram that depicts the distribution of medal winners for

your Olympiad by age. Show the code that you used to accomplish this, along

with a screenshot of your results.

3. Now, modify your histogram by specifying a number of binwidths that you chose

(i.e. not the default number). Specify a color for the bins in your histogram, and

specify another color to use for the borders of the bins. Give your histogram a

descriptive title. Show the code that you used to accomplish this, along with a

screenshot of your results.

4. Imagine that your boss is a smart person, but has no idea what a histogram is --

how would you explain this plot to your boss? Write a one or two sentence

description of what your histogram shows you.

5. Which six NOCs received the greatest numbers of medals? Show the code that

you used to find this out, along with a screenshot of your results. Create a

filtered dataset that only contains medalists from these six NOCs. Show the code

that you used to accomplish this, along with a screenshot of your results.

6. Using ggplot, create a scatterplot that depicts the heights (on the x-axis) and the

weights (on the y-axis) of the athletes from the six NOCs with the most medals.

Give your plot a descriptive title. Show the code that you used to accomplish

this, along with a screenshot of your results. Write a one or two sentence

description of what this scatterplot shows you (again, explain it to your boss).

7. Now, add to the scatterplot that you just created by including a categorical

variable (gender). Show the code that you used to accomplish this, along with a

screenshot of your results. Write a one or two sentence description of what this

scatterplot shows you (again, explain it to your boss).

8. Include yet another categorical variable on your scatterplot -- NOC. Use shape

to represent NOC. Show the code that you used to accomplish this, along with a

screenshot of your results. Write one or two sentences about something that this

plot tells you (you don’t need to summarize the entire plot for this -- you can just

pick a couple data points and describe them here).

9. Again using ggplot, create a barplot that compares the total number of bronze,

silver, and gold medals among the top six NOCs. What do you notice about

these totals? If every Olympic competition generates one gold, one silver, and

one bronze, why might your bars be different heights? (Hint: think about how you

created this subset of the original dataset).


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp