联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2018-09-02 01:47


In this assignment, you will create a program to predict people’s income based on their personal

records, such as age, education, ethnic background, occupation and other information.

You will be given two csv-format data files named “census.csv” and “census-test.csv” respectively.

Each data file contains a header line, and after that each line contains an individual personal

record. The two files have the same format, where “census.csv” contains the training data, and

“census-test.csv” contains the test data. The first 10 lines of “census.csv” is shown in the following.

age, workclass, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss,

hours-per-week, native-country, label(IsOver50K)

39, State-gov, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, 0

50, Self-emp-not-inc, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, 0

40, Private, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, 1

38, Private, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, 0

53, Private, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, 0

28, Private, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, 0

37, Private, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, 0

49, Private, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, 0

52, Self-emp-not-inc, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, 1

You can see that each record contains 14 attributes, and the attribute names are indicated on

the first header line. You need to create a machine learning model to predict the last attribute,

“label(IsOver50K)”, given the other 13 attributes. You will write a program to do this. Your

program should contain the following features - data loading, data preprocessing, model training,

parameter selection, and testing. Note that your program should be written in one of the compiled

programming languages (C, C++, C#, Java, etc), instead of scripting languages such as Python

or Matlab. You need to implement from scratch all components described below and should not call

any external library or package for this work.

1 Data Loading

For the personal records, some of the attributes are in text values, and others are in numerical

values. It is up to you to design the appropriate data structure to store the records loaded from

the csv file. For the purpose of this assignment, you may assume that you know the number of

attributes and which attributes are numbers (or text values). However, for general applicability,

you can not make such assumptions. It would be desirable if your data loading module can deal

with arbitrary csv format.

Note the data file may contain missing values indicated by ’?’ marks. For example, the third

record (i.e. fourth line above) in “’census.csv” file has a ’?’ for the native-country attribute. For

the moment, you can ignore all records with missing values.

2 Data Preprocessing

Data preprocessing is an important step towards training machine learning model for prediction.

Firstly, all input values need to be in the numerical form. This means all text values need to

be converted to numbers. One can employ the one-hot trick to do this. The one-hot encoding

scheme assigns a K-dimensional vector to a text valued attribute, where K equals the total number

of distinct values for that attribute. For the kth distinct value, the corresponding d-dimensional

encoding vector contains a value of 1 for the kth element while all other elements are zero. For

example, if a column contains three distinct values, ’apple’, ’banana’ and ’orange’, with the one-hot

encoding scheme, ’apple’ is represented by vector [1, 0, 0], ’banana’ and ’orange’ are represented by

[0, 1, 0] and [0, 0, 1] respectively. Note that the specific order here is not important, as long as each

distinct attribute value is assigned a unique vector. Note for some attributes, there might be a

large number of distinct values. One-hot encoding scheme produces very high-dimensional vectors,

making it slow and ineffective for learning the predictive model later. You need to come up with a

solution to deal with such cases.

Scale for numerical attributes is another issue that needs to be considered in data preprocessing.

Certain numerical attributes may have larger value ranges than others. For example, ’capital-gain’

and ’capital-loss’ can have much larger values than ’age’. This would affect subsequent machine

learning model to over-emphasize these attributes with larger ranges. Hence it is important to

normalize the numerical values for each attribute so that they are in the same range after normalization.

Here you need to implement a linear scaling scheme a ∗ x + b separately for each column

so that values after transformation are in the range of [0, 1].

3 Model Training

After data preprocessing, the ith input record can be represented by a d-dimensional feature vector

xi = [xi,1, . . . , xi,d] and label yi

, where d is the total number of features and xi,j is the jth feature

value for the ith input record. The data set can be represented by a collection of feature vectors

{x1, . . . , xn} and labels {y1, . . . , yn} where n is the total number of records in the training data set.

Note that we must use the data from the training file alone to create the model. Data from the

testing file can only be used for evaluating the performance of the model.

To create a model to predict value of y given arbitrary x, we need to find out a function f such

that f(xi) roughly equals yi for each training record i = 1, . . . , n. This can be cast as an optimization

function over f. Depending on how we define f and measure the error between f(x) and y, many

different types of the machine learning models can be used. For this assignment, you will create

a simple linear model based on the squared error between prediction values f(xi) and label yi by

optimizing the following objective function with respect to the weight vector w = [w1, . . . , wd]

where f(xi) is the linear prediction function over input xi and is defined by the weight vector w.

The second term in the objective function above is the regularization term that controls the tradeoff

between training error and generalization performance on the testing data. λ is a parameter

that controls the strength of regularization. Its value need to be carefully selected. More on this in

the next section.

The simplest method to optimize the objective function L(w) above is to use gradient descent.

Gradient descent is an iterative optimization method which starts with a initial value for each w

In iteration t, each wj is updated by the following equation.

is the partial derivative of L with respect to wj , β is the step size for the update, which

is usually set to a small value such as 0.01 or 0.001 to ensure the objective value L decreases with

updated w. The above iterations are repeated until convergence, where convergence is reached if

the weight values do not change (up to a small tolerance) between any two iterations or the decrease

in the values of L is smaller than a predefined threshold between any two iteration.

4 Parameter Selection

Parameter λ in Equation (1) needs to be carefully tuned to ensure optimal performance for prediction.

However, we can not tune this parameter based on test performance directly, since the

test data is unknown and can not be used at training time. For parameter selection, one can use

a validation scheme by further splitting the training data into two sets, the training set and the

validation set. For each parameter value, you train the model on the training set and evaluate it

on the validation set (See the next section on how to calculate the accuracy). The process is then

repeated for all parameter values. The parameter that achieves the best accuracy on the validation

set is the optimal parameter for this data and model. Finally, you need to train the model on the

full training data using the optimal parameter selected by the validation scheme.

For this assignment, you need to implement a validation scheme to select the best λ following the

process described above from four possible values of λ, namely {0.001, 0.01, 0.1, 1}. Since the label

values for the training data are unbalanced with about 75% of values equal to 0 and the remaining

25% equal to 1. You need to make sure that when you split the training data into training and

validation sets by randomly assigning records to one of these sets, each set should observe about

the same proportion of label values as the full data. This sampling strategy is also called stratified

sampling. Once the model is trained, you can save the weights w in a model file using either text

or binary format of your choice.

5 Testing

After obtaining the model, you need to apply the model to predict the labels for the test data and

calculate the predictive accuracy. First, once the test records are loaded from “census-test.csv”,

they also need to be preprocessed by applying the same preprocessing steps that were previously

applied to the training data set. After preprocessing, each test record is converted to a feature

vector. We then apply linear prediction model in Equation (2) to get the prediction score. Since

label values are discrete (0 and 1), we need to convert the score to discrete values by rounding

the prediction score to the nearest label value. Finally we can calculate the predictive accuracy by

counting the ratio of correctly predicted values over total number of test records.

6 Bonus Tasks (Optional)

Your program should implement all tasks described in above sections. In addition, if you have time

and are interested, you can attempt the bonus tasks below

1. Handling missing values Both training and test data files contain missing values indicated

by ’?’ mark. Previously we skipped all records with missing values. For this task, you

need to come up with a solution to deal with records with missing values.

2. f with bias term We use a linear prediction function f in Equation (2) without the bias

term. For this task, you will update your training algorithm to deal with prediction function

f with the bias term as given by

f(xi) = X

d

j=1

wjxi,j + b

3. Cross validation In Section 4, we described a simple validation scheme for parameter

selection. For this task, you will implement a more sophisticated and robust scheme for

parameter selection called k-fold cross validation. With k-fold cross validation, the training

data is randomly split into k disjoint subsets of roughly equal size using the stratified sampling

strategy. For each candidate parameter value, you can train a model on k − 1 subsets and

evaluate its performance on the other subset. This is repeated for k times, each time using

a different subset for evaluation and the rest for training. The performance is estimated by

averaging the performance metrics over k rounds. The parameter value achieving the highest

performance is chosen as the optimal parameter for the task, and the model is retrained on

the full training data using the optimal parameter chosen by cross validation.

7 Submission

You need to submit your source code written in a compiled language of your choice accompanied

with a Readme file briefly describing the design of your program and instructions for compiling

and running the program. If you attempted bonus tasks, please also include descriptions of your

solution in the attached Readme file.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp