ECE 730 - Statistical Learning 1
ECE 730 Project
Classifiers:
1. Logistic regression
2. Linear classifier based on the Gaussian generative model
3. Support vector machine
4. Decision tree
5. Random forest
6. Adaboost.
Feature selection method
Compute the coefficient of variation (CoV) for each feature, and select the features whose
CoV value is greater than a threshold. You can use other feature selection methods.
Simulation of training data
Suppose that there are K ≥ 2 classes and p features. The first p1 < p features determine
the class label and the remaining features are irrelevant. First, select K p1 × 1 vectors mk,
k = 1, . . . , K, where all vectors are different, then append p ? p1 zeros to each of these
vectors to form p × 1 vectors μk
, k = 1, . . . , K. Generate nk data points from the Gaussian
distribution with mean equal to μk and covariance σ
2
I, where I is the p × p identify matrix;
these are nk data points from the kth class. Generate data points for all K classes to get a
training dataset of n =
PK
k=1 nk samples.
Real data
Two datasets from the book “The Elements of Statistical Learning” by Hasite, Tibshirani,
and Friedman: the spam dataset and the zip code dataset, both are available at
https://web.stanford.edu/ hastie/ElemStatLearn/
Please select one classifier and do the following:
1. Implement the classifier with and without feature selection. For logistic regression, you
don’t use the CoV method to select feature, but use Lasso or elastic net penalty to
select features. So, you will implement logistic regression without regularization and
with regularization using Lasso or elastic net. For the other classifiers, you will use the
CoV method to select features.
If you choose logistic regression, SVM, or decision tree, you need to use cross validation
to determine parameters of the model. If you choose Adaboost, you can select any base
classifier (not necessary on this list).
You are allowed to use existing software that implements these classifiers. For example,
you can use the glmnet software for logistic regression regularized with the elastic net
penalty.
ECE 730 - Statistical Learning 2
2. Use simulated data to investigate the performance of the classifier. More specifically,
you need to investigate how the classification error is affected by the noise level (σ
2
),
the number of features p, the number of relevant feature p1, the number of training
data samples n, and the feature selection method, ect.
3. Select one real dataset, and run your classifier on the real dataset, compare the performance
of the classifier with and without feature selection.
Each student selects one classifier and one of the following two options for simulation
and real data: 1) K=2 for computer simulation and the spam dataset, and 2) K > 2 for
computer simulation and the zip code dataset. This yields 12 settings. No setting can be
chosen by more than one student.
ECE 730 - Statistical Learning 3
1. The report should be clearly written and include at least the following sections:
(a) Introduction. You should clearly describe the problem need to be solved and
elaborate the purpose or motivation of the project. If you choose the project I
assigned, you can use part of my problem statement, but you need to depict the
purpose or motivation based on your understanding.
(b) Describe your approach to solving the problem in detail. You can choose a proper
title for this section.
(c) Results and Discussions. Describe your simulation setup and real data in detail
so that others can repeat your experiments. Present your results using figures,
tables, ect., and interpret and discuss your results.
(d) Conclusions. Summarize the project briefly and draw conclusions based on your
results.
2. Submit your computer program as well, and I may ask you to run your program to
obtain the results in your report. You can use any computer language. My recommendation
is R and Matlab.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。