Fall 2018
AD699: Data Mining for Business Analytics
Topics: Data Exploration and Visualization
A note about submissions: Unlike Olympic figure skating and ski jumping, AD699 does not
award style points. There are some fancy tools within R for generating reports, such as
RMarkdown, but learning them is not within the scope of this course. The most important thing
here is to answer the questions that ask for written answers, and to show screenshots where
screenshots are asked for.
This assignment is due by 11:59 p.m. on Monday, September 17th.
Step 1:
Download this file from our course Blackboard site:
a) athlete_events.csv
Part I: Data Exploration
1. Bring this file into your R environment. Assign the name athletes to this file. Show the
code that you used to do this. (Remember to first set your working directory to the folder
that contains your files).
2. How many rows and how many columns does athletes contain? How do you know this?
3. Are there any missing values in the athletes data set? If so, how do you know this?
(Note: There are MANY ways that you could answer this question, and any valid way is
completely fine).
4. Remove all rows in the athletes data set that contain any missing values, and store the
results of this operation in a new variable called athletes2. What are the dimensions of
athletes2?
5. Based on the data in athletes2, what is the mean age of an Olympic medalist? What is
the median age? Show the code that you used to find this out, along with a screenshot
of your results.
6. How many Olympic medalists were male, and how many were female? (Hint: Use the
table function to help you with this). Show the code that you used to find this out, along
with a screenshot of your results.
7. How old was the youngest Olympic medal winner in the dataset? How old was the oldest
Olympic medal winner in the dataset? Show the code that you used to find this out,
along with a screenshot of your results.
Part II: Data Visualization
1. Filter the dataset so that it only contains information for your particular Olympiad.
Student Olympiad assignments can be found in Blackboard, in the same folder
that contains this assignment prompt. Assign a new variable name to this
dataset that only contains your Olympiad. Show the code that you used to find
this out, along with a screenshot of your results.
2. Using ggplot, create a histogram that depicts the distribution of medal winners for
your Olympiad by age. Show the code that you used to accomplish this, along
with a screenshot of your results.
3. Now, modify your histogram by specifying a number of binwidths that you chose
(i.e. not the default number). Specify a color for the bins in your histogram, and
specify another color to use for the borders of the bins. Give your histogram a
descriptive title. Show the code that you used to accomplish this, along with a
screenshot of your results.
4. Imagine that your boss is a smart person, but has no idea what a histogram is --
how would you explain this plot to your boss? Write a one or two sentence
description of what your histogram shows you.
5. Which six NOCs received the greatest numbers of medals? Show the code that
you used to find this out, along with a screenshot of your results. Create a
filtered dataset that only contains medalists from these six NOCs. Show the code
that you used to accomplish this, along with a screenshot of your results.
6. Using ggplot, create a scatterplot that depicts the heights (on the x-axis) and the
weights (on the y-axis) of the athletes from the six NOCs with the most medals.
Give your plot a descriptive title. Show the code that you used to accomplish
this, along with a screenshot of your results. Write a one or two sentence
description of what this scatterplot shows you (again, explain it to your boss).
7. Now, add to the scatterplot that you just created by including a categorical
variable (gender). Show the code that you used to accomplish this, along with a
screenshot of your results. Write a one or two sentence description of what this
scatterplot shows you (again, explain it to your boss).
8. Include yet another categorical variable on your scatterplot -- NOC. Use shape
to represent NOC. Show the code that you used to accomplish this, along with a
screenshot of your results. Write one or two sentences about something that this
plot tells you (you don’t need to summarize the entire plot for this -- you can just
pick a couple data points and describe them here).
9. Again using ggplot, create a barplot that compares the total number of bronze,
silver, and gold medals among the top six NOCs. What do you notice about
these totals? If every Olympic competition generates one gold, one silver, and
one bronze, why might your bars be different heights? (Hint: think about how you
created this subset of the original dataset).
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。