联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2022-08-25 09:41


The Australian National University School of Computing, CECS

COMP3430 / COMP8430 – Data Wrangling – 2022

Assignment 1 Due 11:55 pm on Friday 2 September 2022

Worth 10% of the final grade for COMP3430 / COMP8430

Last update June 22, 2022

Overview and Objectives

This assignment covers the topics of data quality, data exploration, and data profiling as presented in the first few weeks of

the course. It also includes questions about what data wrangling is, why it is important, and how it fits into the broader field

of data analytics. One task refers to the required readings from week 1 of the course while others ask you about practical

aspects of data exploration.

Important

The answers to this assignment have to be submitted online in Wattle, see the link Assignment 1 Submission in week

6 (29 August to 2 September).

Follow instructions given for maximum text length in free format answers. If your answer is too long it will attract a

penalty (for details see the individual questions below and the corresponding answer submission forms in Wattle).

You can edit your answers many times and they will be saved by Wattle.

Make sure you submit the final version of your assignment answers before the submission deadline.

Note that Wattle does not allow us to access any earlier edited versions of your answers, so check very

carefully what you submit as the final version!

IMPORTANT: You can only submit your assignment once!

Make sure you do not forget to submit your assignment!

Penalties

Textual questions have maximum line and maximum word limits. If you write more than these provided limits we

will have to apply an over-word-limit penalty. For details of limits see the individual questions below and the corresponding

pages in the assignment submission in Wattle.

Deadlines, Extensions and Late Submissions

The assignment is due 11:55 pm on Friday 2 September 2022.

Students will only be granted an extension on the submission deadline in extenuating circumstances, as defined by ANU

policy (http://www.anu.edu.au/students/program-administration/assessments-exams/deferred-examinations).

If you think you have grounds for an extension, you must notify the course convener as soon as possible and

provide written evidence in support of your case (such as a medical certificate). The course convener will then decide

whether to grant an extension and inform you as soon as practical.

In accordance with the CECS and ANU late submission policy, no late submissions will be accepted, except where an

extension has been approved by the course convener.

Assignment Structure

The assignment consists of four (4) tasks as described below which can be worth different numbers of marks. Make sure you

answer all aspects of each task.

If you have any questions on the assignment please post them on Wattle – however do not post any partial solutions,

program codes, equations, calculations, URLs, etc., or any hints on how to solve any of the assignment tasks.

Plagiarism

No group work is permitted for this assignment.

We do encourage you to discuss your work, but we expect you to do the assignment work by yourself. If you

are unsure about what constitutes plagiarism, make sure you carefully read the ANU Academic Honesty Policy

(http://academichonesty.anu.edu.au/).

If you do include ideas or material from other sources, then you clearly have to make attribution by providing a reference

to the material or source in your submitted assignment answers. We do not require a specific referencing format, as long as

you are consistent and your references allow us to find the source, should we need to while we are marking your assignment.

Marking

This assignment will be marked out of 10, and it will contribute 10% of your final course mark.

Note that not all tasks and questions are equally difficult. For some of the tasks there is no single right or wrong answer.

Marks will be awarded based on your reasoning and the justification of your decisions and explanations, as well as clarity

and correctness of writing.

We will endeavour to release your marks and feedback within two teaching weeks after the submission deadline. If you feel

we have made an error in marking, you have two weeks following the release of marks to raise any issues with the course

convener, after which time your mark will be considered final. If you request that we re-mark your assignment, we

will re-mark the entire assignment and your mark may go up or down as a result.

Data Set Generation for this Assignment and for Assignment 2

For this assignment and the upcoming Assignment 2 each of you will work on an individual data set that will be based on a

master data set we will provide, and a data generation program we will also provide.

Note that we have generated the master data set based on real data (such as lookup tables of names, addresses, etc.), and

we have then corrupted and modified certain aspects of that data set. We have intentionally tried to include the types of

relationships, features, errors, and other data quality issues that you might find in real data sets. Any similarity to real

persons or places is entirely coincidental.

Download the master data set from Wattle (to be made available in week 2) named dw assignment master.csv.gz, and

the data generation program named generate-student-dataset.py. Copy both these files into one folder / directory, and

run the code using Python 3 in the following way:

python3 generate-student-dataset.py your ANU ID dw assignment master.csv.gz

The program will generate an output data set named data wrangling medical 2022 your ANU ID.csv, and print some

output which contains the following important lines (for the example ANU ID u1234567):

$ python3 generate-student-dataset.py u1234567 dw assignment master.csv.gz

Your student data set for the data wrangling 2022 assignments has been generated and written into file:

data wrangling medical 2022 u1234567.csv

Your ANU ID check code is: d76225bc

Your student data set check code is: 216b3fef9401

*** Check this pair of numbers is in the list provided on Wattle, if not contact the course convenor.

Important

Write down your two check codes because you must provide them with the assignment submission. This

will allow us to validate that you have generated and used the correct data set.

Check that the pair of check codes you get (like in the example above d76225bc and 216b3fef940) is in the list of

check codes we will provide on Wattle (in week 2 under the assignment 1 document). This will allow you to check

that you have generated the correct data set.

You must use your individual generated data set for task 4 of this assignment (and the tasks on data

cleaning in the upcoming Assignment 2).

Assignment Tasks

Task 1 (2 marks):

According to the paper (from week 1) by Rahm and Do (Data Cleaning: Problems and Current Approaches), data cleaning

generally deals with detecting errors and inconsistencies from data to improve the quality of data. As mentioned in this

paper, there are many issues and problems related to data cleaning.

Answer the following two questions each in 10 or less lines of text (a maximum of 250 words each), where one text entry

will be provided in Wattle per question.

(1) Do you think the problems and issues related to data cleaning raised in this paper (in the year 2000) are still relevant

today? Justify why or why not?

(2) Imagine you are hired by the Australian Federal Department of Health as a data wrangler to deal with incoming

data sets about COVID-19 cases (details of patients who were diagnosed with the virus) from the seven Australian states

and territories. Your task is to clean and integrate these data sets to support the decision making by the Australian

government.

Briefly describe three (3) data wrangling aspects you will have to consider when dealing with such data sets.

Task 2 (1 mark): Following is a list L of age values (in years) of a group of people:

L = [74,14,20,32,42,55,91,56,84,42,13,7]

First, split your ANU ID (excluding the first character ‘u’) into four number segments (three pairs and a single number)

and then add these four number segments to L. For example if your ANU ID is u1204067 then split it into: 12, 04, 06, 7

and add these numbers to L, so the final list becomes: L = [74, 14, 20, 32, 42, 55, 91, 56, 84, 42, 13, 7,12,4,6,7].

Now calculate and enter into the corresponding answer fields on Wattle (follow the specifications given in Wattle about

how many decimal places to report):

1. the mean and standard deviation of the final list L,

2. the median and median absolute deviation of the final list L, and

3. the mode of the final list L.

Task 3 (2 marks): Apply binning as covered in the lectures to the numbers in the list L as generated in the previous

task (i.e. L including the number segments based on your ANU ID appended).

Calculate and enter into the corresponding answer fields on Wattle the results when binning L using:

1. equal depth with two bins and smoothed by bin median,

2. equal width with three bins and smoothed by bin mean,

3. equal width with four bins and smoothed by bin boundaries, and

4. equal depth with four bins and smoothed by bin boundaries.

Clearly show the bins you generated when you enter your answers into Wattle answer fields by showing one bin per line,

for example (assume we have binned [1,2,3,4,5,6,7,8,9] into three bins with smoothing by bin medians):

Bin 1: [2, 2, 2]

Bin 2: [5, 5, 5]

Bin 3: [8, 8, 8]

Task 4 (5 marks):

For the last task of this assignment you must use the data set you generated as per instructions above. We ask

you to explore this data set using tools of your choice (Rattle, R, Python, Pandas, etc.) and answer the specific questions

about this data set given below.

Make sure to follow the instructions on the individual Wattle answer fields with regard to rounding, the

number of digits to provide after the decimal point, etc.

1. Provide the missingness patterns of values (as we discussed in the labs) for the three attributes: postcode, phone,

and email. You should provide the 0-1 missing value pattern table we discussed in the labs for the above three

attributes.

2. Calculate the correlation between the attributes (a) BMI and age at consultation, (b) BMI and height, and (c)

state and valid marital status. In your answers you need to provide the numerical correlation value, the name

of the correlation method you used, and a brief (one sentence) explanation why you used that specific correlation

function for each pair of attributes. (Note: the correlation statistic is not the same as p-value)

3. For the following attributes, calculate numerical values for the following data quality dimensions:

(a) Completeness for postcode and phone (consider these attributes individually).

(b) Validity for weight.

(c) Uniqueness for last name.

(d) Consistency between age at consultation and birth date (for valid age values).

Clearly describe how you calculated each of your results.

4. Calculate the distributions of the first digits (Benford’s law) for the attributes (a) cholesterol level, (b) blood pressure

and (c) medicare number. Describe for each in one or two sentences if it does follow Benford’s law or not, and

why you think it does or does not follow this law.

5. Describe in a few sentences three other (not covered in the first four questions) unusual characteristics you can

identify in this data set using data exploration and profiling.

You will receive up-to one mark for correctly answering each of these questions, where both correct numerical values as

well as correct and clearly written justifications of your answers will be considered.

Other Aspects

For all textual answers in this assignment, English writing mistakes and typographical errors will attract small penalties.

Do not upload any code into the answer boxes in Wattle.


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp