联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> CS作业CS作业

日期:2018-09-17 01:25


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTING

BIG DATA AND DATA ANALYTICS

LAB PROJECT 1

This lab project is based on a dataset from the National Institute of Diabetes and Digestive and

Kidney Disease, which is available from the UCI Machine Learning Repository (Lichman, 2013):

https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

EXERCISE 1 (1 MARKS) [R-CODE]

Use R to load the dataset “pimadata.csv”. Determine the number of lines and columns in the dataset.

Safe these values into two separate variables called “numberoflines” and “numberofcolumns”. Use

the cat-command to display these two variables on the R console.

EXERCISE 2 (1 MARKS) [R-CODE]

Use R to compute the mean BMI value (variable: BMI) for subjects with diabetes. Then, use R to

compute the mean BMI value (variable: BMI) for subjects without diabetes. Display the difference in

mean BMI between these two groups on the R console using the cat-command.

EXERCISE 3 (2 MARKS) [R-CODE]

Use R to determine the standard deviation and the variance of the variable TSFT for subjects with

diabetes. Display these values on the screen using the cat-command. Note: Only take the

observations of the subjects with TSFT greater than zero into account. How can a TSFT value of 0 be

interpreted?

EXERCISE 4 (1 MARK)

In your own words, describe the difference between a standard error (of the mean) and a standard

deviation using the variable TSFT in the “pimadata.csv” dataset as an example.

EXERCISE 5 (2 MARKS) [R-CODE]

Based on the dataset “pimadata”, create a new column “agecat” in the dataframe that describes the

age category of a person. Distinguish between the following categories: “21 to 35”, “36 to 55”, and

“56 to 85”. Convert the column into a factor variable using the as.factor() command. For each of

these age categories, calculate the median BMI (i.e., the median BMI for subjects aged 21 to 35, the

median BMI for subjects aged 36 to 55, and the median BMI for subjects aged 56 to 85).

2/3

Use R to determine the median BMI across the three different categories “21 to 35”, “36 to 55”, and

“56 to 85”) and combine them into a vector using the c()-command. Note: Please only take the

observations into account where the BMI is greater than zero. Safe the vector into a variable called

“medianBMIs”.

Then, use the cat-command to display the minimum median BMI (i.e., the lowest of the three

median BMIs) and the maximum median BMI (i.e., the highest of the three median BMIs) on the R

console. The minimum and maximum values should be determined based on the newly created

variable “medianBMIs”.

EXERCISE 6 (1 MARK) [R-CODE]

Use if() to compare the median BMIs in the “21 to 35” and the “36 to 55” setting and display a

textual statement on the screen that describes which of the two median BMIs is higher.

EXERCISE 7 (2 MARKS) [R-CODE]

Write a function that determines the 99% CI of the mean for a given vector x. Call this function

“calc99CI”. The function should return a vector of two values: (i) the lower bound of the 99% CI and

(ii) the upper bound of the 99% CI. Use the c() to combine the two values into a vector.

Use this function to display the 99% CI of the mean for the BMI for subjects aged “36 to 55”. Note:

Please only take the observations into account where the BMI is greater than zero.

[Important: CI stands for confidence interval. In the lecture, we discussed the multiplier for the 95%

CI, which is 1.96. The multiplier for the 99% CI is 2.575]

REFERENCES

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:

University of California, School of Information and Computer Science.

3/3

DATASET

Pimadata Pima Indians Diabetes Database

Description

A diabetes dataset. All patients here are females at least 21 years old of Pima Indian heritage.

Note: Even though the dataset donors made no such statement, it seems very likely that several

values zero values encode missing data for several variables.

Usage

Pimadata

Format

A data frame with 768 observations on the following 9 variables.

timesPregnant Number of times pregnant

PCG Plasma glucose concentration a 2 hours in an oral glucose tolerance test

DBP Diastolic blood pressure (mm Hg)

TSFT Triceps skin fold thickness (mm)

insulin 2-Hour serum insulin (mu U/ml)

BMI Body mass index (weight in kg/(height in m)^2)

DPF Diabetes pedigree function. It provides some data on diabetes mellitus

history in relatives and the genetic relationship of those relatives to the

patient. This measure of genetic influence gives an idea of the hereditary

risk one might have with the onset of diabetes mellitus.

age Age (Years)

diabetes 1 tested positive for diabetes

0 tested negative for diabetes

Source

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:

University of California, School of Information and Computer Science.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp