联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2021-02-27 10:55

Exploratory Report

Due Friday by 10pm Points 100 Submitting a text entry box

Submit Assignment

For this part of the final project, you will be creating a data report exploring a pair of data sets of your

choosing. Your report will introduce this data, provide some preliminary/summary information, and

then answer some data science questions about those data sets.

Creating this report will require you to apply all of the data analysis and programming skills you've

learned in the course so far. In addition, this project will also require you to use git as a

collaboration tool: you'll write the report together, on different computers using git to share code.

This is a group assignment! You will be working closely with your project group to complete it!

You will also be required to fill out a peer evaluation to

provide us with some data about how well your team collaborated.

Objectives

By completing this assignment you will practice and master the following skills:

Synthesizing skills, tools, and concepts from across the course

Performing a complete data analysis project, from data gathering to presentation

Working with a team of programmers to analyze data

Using git to manage code that is being modified by several people at the same time

Using git branches to organize code development and publish reports as web pages

Setup

For the project you will be working in the same repo you created for the Project Proposal. You will be

generating new files (script and R Markdown) and putting them in this repo.

Make sure that everyone in your group has Admin access

(https://help.github.com/articles/inviting-collaborators-to-a-personal-repository/) to this

repository, and that they have all cloned a copy to their local machines so that you can code

collaboratively.

Collaborating on the Report

One of the goals of the project is to practice collaborative coding. Each person will need to

contribute code to the solution. This means that everyone will need to write some of the code that

is utilized in the report (we will be checking commit history to ensure that everyone has contributed).

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 2/8

The report you will produce has multiple sections; some may be easy to "divide up" and let each

person take the lead, but others will require you to work together. This not intended to be a project

with individual parts, but a collaborative team effort and will be graded as such.

We highly encourage pair-programming (https://en.wikipedia.org/wiki/Pair_programming) (two

people working at one computer). Just make sure you switch off who is the "driver", and that both

partners commit changes to the code.

You are also required to fill out a peer evaluation (https://forms.gle/Xy1D8knKGWezg8s87) to help

us evaluate group collaboration. Scores on this assignment will be adjusted to reflect unequal

contributions (if any), as discussed in lecture.

The project will be graded for the entire group; thus the entire group is responsible for all of the

content in the report.

Using Feature Branches

All of your development work on this report must be done using feature branches. This means that

for each feature (i.e., section) of the report, you will need to checkout a new branch, develop your

code on that branch, and then merge those changes back into main . This is intended to help you

learn to work with branches, as well as to help organize the work.

Note that you can push and pull feature branches to GitHub (with git push origin branch_name ) in

order to share them, and merge commits between feature branches. Thus if two people want to

"combine" their code, they could merge their branches together, and then merge that work back into

the main branch.

Pro tip: If there is a merge conflict in the .html file generated by R Markdown, the easiest way to

resolve it is to fix any errors in the .Rmd file and then simple "re-knit" the output. This will create a

new .html file that will overwrite the previous one... thereby giving you a correctly "updated" file

that you can mark as resolving the conflict.

We will be looking for evidence that you successfully used such branching (e.g., commit messages

reflecting the branch merges, or branches that have been pushed to GitHub).

Report Structure

For this part of the project, you will be collaboratively producing a report that presents an exploratory

analysis of your chosen data sets.

Because this project is open-ended (and its the final project), we will not necessarily give

explicit instructions about how to complete it, or even the order in which you should do the

work. Instead we will describe the overall requirements for your report, and it will be up to you

to determine how to meet those specifications.

Your report will be a web page created using R Markdown and knitr . Your report must be written in

a file called index.Rmd located in the root of your project repository (you can do create this file using

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 3/8

the RStudio wizard).

Your report must include an appropriate title, as well as your names as the author (including your

group number, found on Canvas under "People > Groups"—search for your name to find your

group number).

Your data wrangling ( R code) must be written in separate .R script files, and you must use

source() to load those scripts into the R Markdown. You are welcome to use multiple script files

(e.g., one per report section or one per data set) to better organize your code. However, be sure that

you don't duplicate code between script files.

In particular, don't have multiple scripts read/load the same data file—you should only call

read.csv() once per data set for your entire knitted document. One way to do this is to load a

data file in the setup chunk, and then pass that data frame as an argument to functions defined in

the .R scripts.

Although it may be tempting, do not try to do "one .R file per person". Your code should be

organized based on the data wrangling that occurs, not based on who the author is! You will need

to collaborate on the programming, with multiple people editing the same code files.

Your report will be divided into distinct sections. Each section will need to include an appropriate 2ndlevel

heading. Additionally, make sure each section/subsection has at least a sentence of text

introducing it—don't just launch into data tables or graphics!

Note that each section/subsection can and should contain a code chunk used specifically by that

section, but any "shared calculations" (such as library() and source() calls) should be done in

the report's initial setup chunk.

The sections your report needs to include are described below.

Section 1. Problem Domain Description

The first section of your report swill include a short text introduction to the problem domain that your

data sets are related to. After reading this section, we should understand the "what" and "why" of your

data set. You will explain the context and justification for your analysis. What domain knowledge is

needed to understand your report?

You can (and should!) copy this directly from your project proposal; you don't necessarily need

to do any "new" work for this section.

This section should be a paragraph or two in length (around 150-250 words, depending on the

complexity of the domain). Note that this section doesn't need to contain any R code, though you will

need to include Markdown formatting when appropriate. Be sure to include hyperlinks to relevant

resources.

Section 2. Data Description

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 4/8

This next section of your report will provide descriptions and examples of your chosen data sets. You

are explaining what data you have found that will be able to answer your data science questions,

below. This section will need to include the following content (in whatever order you choose):

It's fine to use a different subsection for each data set if that makes it easier to read or write.

1. A non-technical description of the data sets you will be using (what is the data?) This only needs

to be a sentence or two.

2. An explanation of where the data comes from, who originally collected the data, and any other

information we may need to know about how this data set came to be. Make sure you include

hyperlinks to the source—we must be able to follow the links and find your data sets ourselves.

Again, this only needs to be a couple sentences, and you can borrow ffrom your project proposal.

3. A sample of the data set, so that we can see what raw data you'll be working with ("the data set

looks like this"). This means that you'll need to load the data set into R (e.g., with read.csv() )

and present it as a table (or multiple tables) in your report. Do not include the entire table; just the

a few rows is sufficient. Think about the "user experience" of reading the report!

You don't need to include all columns of your data frames; only including the most

important/relevant ones is acceptable.

You are not required to do substantive data cleaning (or even rename columns), though it

wouldn't hurt to do some of that wrangling now instead of later.

Remember to do your data wrangling in the .R script file, not in the R Markdown file!

You will need to provide samples of both of your two data sets (so at least 2 tables!)

4. A written explanation of the sample data's structure. In particular, explain what each of the

features (columns) represents. You don't need to explain columns like "year" or "country", but any

columns that don't make sense on their own need to be explained. Be sure to note units for

differnet values (e.g., if a value is in dollars of kg).

A Markdown list is a nice format for providing this information.

Overall, this section will explain the data that you will be analyzing—while also ensuring that you can

both access and understand that data!

Subsection 2.2 Summary Analysis

As part of your data description, your report will also provide an overview summary analysis of the

two data sets. This will provide a more quantitative description of the data, helping the reader to

understand the range and distribution of the data and providing a baseline for your analysis.

This will be a subsection (with a third-level heading) of your data description. Your summary analysis

needs to include the following for each data set:

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 5/8

1. Provide summary descriptive statistics and central tendancy measures (e.g., mean() , max() , etc)

for all the major features for interest from each of your data sets. The built-in summary() function

can be helpful for this, as can other summarizations functions.

Note that you are not required to include descriptive statistics for every feature/column, only

for the ones that will be relevant or interesting for your analysis.

You can present these statistics in the form of a table, but you'll need a sentence or two to

explain them. You can also use a list or inline expressions to provide this information.

This is a good point to determine and explain how you will handle missing or NA

values.

2. Include at least 3 graphics or plots illustrating the distribution or trends of the data. For example,

you could use a histogram, box plot, or violin plot to show how the values in your data are spread

out. You could also use a line or bar chart to show changes in values over time.

Each graphic can visualize a single variable (similar to the examples in Ch 15.2.1

(https://learning.oreilly.com/library/view/programming-skillsfor/9780135159071/ch15.xhtml#ch15_2_1)

), or you can include multiple measures in a single

chart. Smooth geometry (https://ggplot2.tidyverse.org/reference/geom_smooth.html) can

also be good for showing trends among messy data.

Again, you don't need to illustrate the distribution of all of the data, only the most important

features. The goal is to get a general sense of the data's trends to then inform your more

specific analysis.

ALL graphics in your report should be well-designed. This means they have good use

of aesthetic mapping, labeling, etc.

3. Identify any significant outliers in your data sets. These are values that are particularly high or low

or missing. You'll want to note the outliers because of how they may skew your analysis—a single

very high value may throw off averages, and missing data can hinder the analysis.

You'll need to use prose to note the outliers (using inline R expressions to dynamically report

the values of course). If there are no significant outliers, you should mention that—and explain

why this might be the case!

Overall, this section should be a couple (2-4) paragraphs in length, in addition to the required

graphics.

Section 3: Specific Question Analyses

This last section will provide answers to specific data science questions (such as the ones you posed

in your Project Proposal). You will perform the data wrangling/analysis needed, explain your process,

and write up the results (answers).

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 6/8

You will need at least one good question per group member. Try to limit your analysis to no more

than two questions per member—anything more than that either means your project scope will be

too big or that your questions are not interesting enough.

At least 40% of your questions need to involve both data sets, such as comparing features from

each.

Each question will be addressed in its own subsection (with a third-level heading). This will help keep

your analysis organized, though you are welcome to refer to other subsections as appropriate. Each

subsection will need to provide the following:

1. Introduce and explain the question if necessary. For example, you may need to define

terminology you use, or otherwise scope the question. If the question is "are Uber rates greater

than taxi rates", then you might explain what you mean by "rate" (given variable pricing structures)

or identify which type of rides you'll be considering. This might only be a couple of sentences.

2. An explanation of your data analysis method. This can be a brief written description of your data

wrangling steps (e.g., "I took the average Uber rate and compared it to the average taxi rate for

each hour of the day"). You can even output specific code if it helps explain your process! The

goal is to enable the reader to know how to repeat your analysis so they can check your work.

This doesn't need to be lengthy; a sentence or two may be enough (though complex analysis may

require a whole paragraph).

The vast majority of your programming and data wrangling work will happen at this step!

3. Provide the results of your analysis. Results will need to be both quantitative (e.g., numbers and

tables) and graphical.

The quantitative results will be the relevant data that your wrangling produces—for example,

you might include a data table of the average Uber and taxi rates that you found. We don't

need the entire "raw" data frame, but there should be a table of values you produced and

considered.

Calculated values such as correlations are also considered quantitative results.

The graphical results will be a least one plot illustrating the results of your data wrangling. For

most questions these plots should be straightforward to design—in the Uber example, it might

be a line chart comparing average rates over time, or a scatterplot showing the distribution of

driver pay rates.

Remember that at least 40% of your questions need to involve both data sets. That

means that they'll almost certainly require joining the data sets together for your

analysis.

4. Finally, you must include an evaluation of your results&. An evaluation is an interpretation of your

results—the information, not just the data. Your evaluation should draw a conclusion from the

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 7/8

results—it is the answer to your question! As (mostly) made-up examples: "Uber rates are

cheaper than taxis only in urban areas" or "Uber does not pay drivers a livable wage".

Note that your evaluation cannot rely purely on visual or anecdotal analysis (no "the line goes

up!" or "the measure for one state looks large!") Be sure to use descriptive statistics (e.g.,

mean/median) or measures of effect strength (e.g., correlations or predictive statistics) to

definitively state relationships among your data. You do not need to perform advanced

statistical analysis—this is not a stats class!—but your conclusions need to be grounded in the

data, not in the representation.

It's quite likely that the results may not provide the answer you expected, and that's okay! In

your evaluation, you can mention that, and offer a guess as to why your assumptions didn't

hold up.

Overall, we are looking for evidence of critical thinking—that you have thought about the

results of your data wrangling in addition to the programming. Considering how the answer to

your question would influence action is a good way to push yourself to think critically.

Overall, this part should present specific insights, using the data (results) itself as evidence

of your claims.

Publishing Your Report

As partially described in the course text (https://learning.oreilly.com/library/view/programmingskills-for/9780135159071/ch18.xhtml#sec18_4)

(and further detailed in class), you should publish your

knitted report online by deploying it to GitHub Pages (https://help.github.com/articles/userorganization-and-project-pages/)

(as a Project Pages site).

You must publish to GitHub pages by pushing to the gh-pages branch on GitHub (Do not use one of

the other publishing methods). The cleanest way to do this is to create a new gh-pages branch

immediately off of your main branch once you're ready to publish. Be sure and push this new ghpages

branch to the gh-pages branch on GitHub.

In order to "activate" GitHub pages, the first push to the gh-pages branch needs to be made by

someone with administrative privileges (not just write privileges)—so make sure all of your

teammates are Admins!

Once you've pushed your report to the gh-pages branch, you should then be able to visit

https://info201b-wi21.github.io/project-YOUR_GITHUB_USER_NAME/index.html

in order to view your report online (specifically, the generated index.html file). Note that sometimes it

takes a few minutes for GitHub pages to update with the new report, so be patient if it doesn't show

up initially. (You can check that the code is pushed correctly by checking the gh-pages branch on the

GitHub web portal; if the code is online but the page isn't showing, check in with us).

iSchool Canvas Support

2021/2/25 Exploratory Report

https://canvas.uw.edu/courses/1434910/assignments/5937467 8/8

Be sure that any future changes are made on the main branch, not on gh-pages (we'll be

grading the code on master ). NEVER EDIT CODE ON THE gh-pages BRANCH. If you make

additional changes (on main ), you will need to then merge them into the gh-pages branch

(watching out for any merge conflicts). If you hit any problems, ask for help!

Submit Your Project Report

In order to submit your project report:

1. Confirm that you've successfully completed the report (e.g., that your code is able to generate a

report that includes all of the required information).

Please proofread your report! Make sure there aren't any half finished sentences or

egregious typos, and that overall it is cleanly formatted and readable. It should be in better

condition than these assignment write-ups! Moreover, since there are multiple people on

your team to read it over, you should definitely have caught and fixed any mistakes

2. Your group must merge the final version of the code (including the index.html file and any .R

files) to the main branch and push code to the GitHub repository (to the main branch).

3. Publish the report to GitHub Pages (push to the gh-pages branch). Make sure you publish the

final version of the knitted report!

4. Make sure that each group member has filled out the peer evaluation

(https://forms.gle/Xy1D8knKGWezg8s87) . All team members must complete this!

5. Submit the URL of your GitHub Repository AS WELL AS the URL of your published report (two

links!) as your assignment submission on Canvas (this page, at the top).

Since this is a group project, only one person needs to submit the link via Canvas.

iSchool Canvas Support


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp