In this assignment, you will create a program to predict people’s income based on their personal
records, such as age, education, ethnic background, occupation and other information.
You will be given two csv-format data files named “census.csv” and “census-test.csv” respectively.
Each data file contains a header line, and after that each line contains an individual personal
record. The two files have the same format, where “census.csv” contains the training data, and
“census-test.csv” contains the test data. The first 10 lines of “census.csv” is shown in the following.
age, workclass, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss,
hours-per-week, native-country, label(IsOver50K)
39, State-gov, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, 0
50, Self-emp-not-inc, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, 0
40, Private, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, 1
38, Private, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, 0
53, Private, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, 0
28, Private, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, 0
37, Private, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, 0
49, Private, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, 0
52, Self-emp-not-inc, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, 1
You can see that each record contains 14 attributes, and the attribute names are indicated on
the first header line. You need to create a machine learning model to predict the last attribute,
“label(IsOver50K)”, given the other 13 attributes. You will write a program to do this. Your
program should contain the following features - data loading, data preprocessing, model training,
parameter selection, and testing. Note that your program should be written in one of the compiled
programming languages (C, C++, C#, Java, etc), instead of scripting languages such as Python
or Matlab. You need to implement from scratch all components described below and should not call
any external library or package for this work.
1 Data Loading
For the personal records, some of the attributes are in text values, and others are in numerical
values. It is up to you to design the appropriate data structure to store the records loaded from
the csv file. For the purpose of this assignment, you may assume that you know the number of
attributes and which attributes are numbers (or text values). However, for general applicability,
you can not make such assumptions. It would be desirable if your data loading module can deal
with arbitrary csv format.
Note the data file may contain missing values indicated by ’?’ marks. For example, the third
record (i.e. fourth line above) in “’census.csv” file has a ’?’ for the native-country attribute. For
the moment, you can ignore all records with missing values.
2 Data Preprocessing
Data preprocessing is an important step towards training machine learning model for prediction.
Firstly, all input values need to be in the numerical form. This means all text values need to
be converted to numbers. One can employ the one-hot trick to do this. The one-hot encoding
scheme assigns a K-dimensional vector to a text valued attribute, where K equals the total number
of distinct values for that attribute. For the kth distinct value, the corresponding d-dimensional
encoding vector contains a value of 1 for the kth element while all other elements are zero. For
example, if a column contains three distinct values, ’apple’, ’banana’ and ’orange’, with the one-hot
encoding scheme, ’apple’ is represented by vector [1, 0, 0], ’banana’ and ’orange’ are represented by
[0, 1, 0] and [0, 0, 1] respectively. Note that the specific order here is not important, as long as each
distinct attribute value is assigned a unique vector. Note for some attributes, there might be a
large number of distinct values. One-hot encoding scheme produces very high-dimensional vectors,
making it slow and ineffective for learning the predictive model later. You need to come up with a
solution to deal with such cases.
Scale for numerical attributes is another issue that needs to be considered in data preprocessing.
Certain numerical attributes may have larger value ranges than others. For example, ’capital-gain’
and ’capital-loss’ can have much larger values than ’age’. This would affect subsequent machine
learning model to over-emphasize these attributes with larger ranges. Hence it is important to
normalize the numerical values for each attribute so that they are in the same range after normalization.
Here you need to implement a linear scaling scheme a ∗ x + b separately for each column
so that values after transformation are in the range of [0, 1].
3 Model Training
After data preprocessing, the ith input record can be represented by a d-dimensional feature vector
xi = [xi,1, . . . , xi,d] and label yi
, where d is the total number of features and xi,j is the jth feature
value for the ith input record. The data set can be represented by a collection of feature vectors
{x1, . . . , xn} and labels {y1, . . . , yn} where n is the total number of records in the training data set.
Note that we must use the data from the training file alone to create the model. Data from the
testing file can only be used for evaluating the performance of the model.
To create a model to predict value of y given arbitrary x, we need to find out a function f such
that f(xi) roughly equals yi for each training record i = 1, . . . , n. This can be cast as an optimization
function over f. Depending on how we define f and measure the error between f(x) and y, many
different types of the machine learning models can be used. For this assignment, you will create
a simple linear model based on the squared error between prediction values f(xi) and label yi by
optimizing the following objective function with respect to the weight vector w = [w1, . . . , wd]
where f(xi) is the linear prediction function over input xi and is defined by the weight vector w.
The second term in the objective function above is the regularization term that controls the tradeoff
between training error and generalization performance on the testing data. λ is a parameter
that controls the strength of regularization. Its value need to be carefully selected. More on this in
the next section.
The simplest method to optimize the objective function L(w) above is to use gradient descent.
Gradient descent is an iterative optimization method which starts with a initial value for each w
In iteration t, each wj is updated by the following equation.
is the partial derivative of L with respect to wj , β is the step size for the update, which
is usually set to a small value such as 0.01 or 0.001 to ensure the objective value L decreases with
updated w. The above iterations are repeated until convergence, where convergence is reached if
the weight values do not change (up to a small tolerance) between any two iterations or the decrease
in the values of L is smaller than a predefined threshold between any two iteration.
4 Parameter Selection
Parameter λ in Equation (1) needs to be carefully tuned to ensure optimal performance for prediction.
However, we can not tune this parameter based on test performance directly, since the
test data is unknown and can not be used at training time. For parameter selection, one can use
a validation scheme by further splitting the training data into two sets, the training set and the
validation set. For each parameter value, you train the model on the training set and evaluate it
on the validation set (See the next section on how to calculate the accuracy). The process is then
repeated for all parameter values. The parameter that achieves the best accuracy on the validation
set is the optimal parameter for this data and model. Finally, you need to train the model on the
full training data using the optimal parameter selected by the validation scheme.
For this assignment, you need to implement a validation scheme to select the best λ following the
process described above from four possible values of λ, namely {0.001, 0.01, 0.1, 1}. Since the label
values for the training data are unbalanced with about 75% of values equal to 0 and the remaining
25% equal to 1. You need to make sure that when you split the training data into training and
validation sets by randomly assigning records to one of these sets, each set should observe about
the same proportion of label values as the full data. This sampling strategy is also called stratified
sampling. Once the model is trained, you can save the weights w in a model file using either text
or binary format of your choice.
5 Testing
After obtaining the model, you need to apply the model to predict the labels for the test data and
calculate the predictive accuracy. First, once the test records are loaded from “census-test.csv”,
they also need to be preprocessed by applying the same preprocessing steps that were previously
applied to the training data set. After preprocessing, each test record is converted to a feature
vector. We then apply linear prediction model in Equation (2) to get the prediction score. Since
label values are discrete (0 and 1), we need to convert the score to discrete values by rounding
the prediction score to the nearest label value. Finally we can calculate the predictive accuracy by
counting the ratio of correctly predicted values over total number of test records.
6 Bonus Tasks (Optional)
Your program should implement all tasks described in above sections. In addition, if you have time
and are interested, you can attempt the bonus tasks below
1. Handling missing values Both training and test data files contain missing values indicated
by ’?’ mark. Previously we skipped all records with missing values. For this task, you
need to come up with a solution to deal with records with missing values.
2. f with bias term We use a linear prediction function f in Equation (2) without the bias
term. For this task, you will update your training algorithm to deal with prediction function
f with the bias term as given by
f(xi) = X
d
j=1
wjxi,j + b
3. Cross validation In Section 4, we described a simple validation scheme for parameter
selection. For this task, you will implement a more sophisticated and robust scheme for
parameter selection called k-fold cross validation. With k-fold cross validation, the training
data is randomly split into k disjoint subsets of roughly equal size using the stratified sampling
strategy. For each candidate parameter value, you can train a model on k − 1 subsets and
evaluate its performance on the other subset. This is repeated for k times, each time using
a different subset for evaluation and the rest for training. The performance is estimated by
averaging the performance metrics over k rounds. The parameter value achieving the highest
performance is chosen as the optimal parameter for the task, and the model is retrained on
the full training data using the optimal parameter chosen by cross validation.
7 Submission
You need to submit your source code written in a compiled language of your choice accompanied
with a Readme file briefly describing the design of your program and instructions for compiling
and running the program. If you attempted bonus tasks, please also include descriptions of your
solution in the attached Readme file.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。