联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-11-14 11:22

Objectives

Curriculum Design is a systematic way as well as a practical means of strengthening the theories and methods in the course of Data Ming. In the Curriculum Design for Data Mining, some simulated real application data sets are provided and several curriculum design projects are planned. By doing the Curriculum Design, the students will master the techniques such as:

1.The handling of real application data through data base techniques;

2.The big data mining steps with elementary supervised learning methods;

3.The strategies for evaluating classifiers;

4.The main aspects that impact a classifier’s performances;

5.The primary tools to solve real application problem with data mining.



Project 1: Comparison between supervised learning algorithms

1.Data set

Refer to the affiliated files: adult.train, adult.test and adult.desctiption.

adult.train file is used for training, adult.test for test, adult.desctiption for description of the attributes in data.

The data have missing values labeled as ‘?’

2.Tasks

(1)Data preprocess. Migrate the data from the files to a data base such oracle, then process the data by data base techniques. Remove the tuples with missing values.

(2)Building prediction models using the training data. The elementary supervised learning methods, such as Naïve Bayesian classification, ID3, C4.5, CART, BPANN, are used for training a classifier, respectively.

(3)Accuracy comparison between different classifiers



Project 2: Investigation of noisy data impact

1.Data set

Refer to the data for project 1.

2.Tasks

(1)Data preprocess. Do not remove the tuples with missing values. Instead, replace the missing values with a proper value in the same column, e.g., mean value, a regressed value, or other values derived by data imputation techniques.

(2)Building a prediction model using C4.5.

(3)Accuracy comparison between classifiers by C4.5 on two sets of data without and with missing values.



Project 3: Simulated application

1.Introduction to letter recognition application

The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 numerical attributes.


Examples of the character images generated by these procedures are presented in the Figure. Each character image was then scanned, pixel by pixel, to extract 16 numerical attributes. These attributes represent primitive statistical features of the pixel distribution. To achieve compactness, each attribute was then scaled linearly to a range of integer values from 0 to 15. This final set of values was adequate to provide a perfect separation of the 26 classes. That is, no feature vector mapped to more than one class. The attributes (before scaling to 0-15 range) are:

(1) The horizontal position, counting pixels from the left edge of the image, of the center of the smallest rectangular box that can be drawn with all "on" pixels inside the box.

(2) The vertical position, counting pixels from the bottom, of the above box.

(3) The width, in pixels, of the box.

(4) The height, in pixels, of the box.

(5) The total number of "on" pixels in the character image.

(6) The mean horizontal position of all "on" pixels relative to the center of the box and divided by the width of the box. This feature has a negative value if the image is "leftheavy" as would be the case for the letter L.

(7) The mean vertical position of all "on" pixels relative to the center of the box and divided by the height of the box.

(8) The mean squared value of the horizontal pixel distances as measured in 6 above. This attribute will have a higher value for images whose pixels are more widely separated in the horizontal direction as would be the case for the letters W or M.

(9) The mean squared value of the vertical pixel distances as measured in 7 above.

(10) The mean product of the horizontal and vertical distances for each "on" pixel as measured in 6 and 7 above. This attribute has a positive value for diagonal lines that run from bottom left to top right and a negative value for diagonal lines from top left to bottom right.

(11) The mean value of the squared horizontal distance times the vertical distance for each "on" pixel. This measures the correlation of the horizontal variance with the vertical position.

(12) The mean value of the squared vertical distance times the horizontal distance for each "on" pixel. This measures the correlation of the vertical variance with the horizontal position.

(13) The mean number of edges (an "on" pixel immediately to the right of either an "off" pixel or the image boundary) encountered when making systematic scans from left

(15) The mean number of edges (an "on" pixel immediately above either an "off" pixel or the image boundary) encountered when making systematic scans of the image from bottom to top over all horizontal positions within the box.

(16) The sum of horizontal positions of edges encountered as measured in 15 above.

2.Data set

Refer to the affiliated files: letter-recognition.data and letter-recognition.desctiption.

letter-recognition.data file is used for training and test, adult.desctiption for description of the attributes in data.

3.Tasks

(1)Data preprocess. Migrate the data from the files to a data base such oracle.

(2)Data partition by Hold-out method, i.e., randomly divide the data into two parts, 2/3 as training set and 1/3 as test set.

(3)Building a prediction model using C4.5 on training set.

(4)Assessing its accuracy on test set.



Project 4: Comparison between evaluating methods

1.Data set

Refer to the data for project 3.

2.Tasks

(1)Building a prediction model/classifier using C4.5.

(2)Evaluate its accuracy by Hold-out method (i.e., project 3), Random sampling, 10-CV, stratified 10-CV and bootstrap, respectively.

(3)Accuracy comparison between classifiers by C4.5 under different evaluating methods.



Project 5: Investigation of Pruning to overfitting

1.Data set

Refer to the data for project 3.

2.Tasks

(1)Building a prediction model using CART.

(2)Building a prediction model using CART with CCP.

(3)Accuracy comparison between classifiers by CART without and with pruning.



Requirements

1.The experiment is carried out in a group of no more than 5 students. Every group has to finish the 5 compulsory projects before the due date.

2.Python or R can be used to program for your projects, but Python is preferred since it will help you find a good job in the near future.

3.In order to finish the projects, you can download the packages from the online resources and make modification, but you should understand all the codes involved in your projects.

4.To ensure that the curriculum design can be implemented smoothly, each group should select one as the head in charge of the team’s work. He is responsible to organize the team members to do the five projects in collaboration. He has the right to assign the tasks to each member and decide the contribution rate of each member.

5.最后书写课程设计报告,经过组长协调和同意,每个组成员只能选择至多一个project阐述完成的工作。



Evaluation of your work

1.Your performance in this course is evaluated based on curriculum design report. Every one should finish the report according to the tasks. Your performance will be judged by:

Completeness. Each group should finish all the 5 compulsory projects, or the team scores will be deducted by certain amount for one missing project or task.

Correctness. Please try several times for each project until you are sure that the final result is right.

Format. Please edit your report in a unified format. In your report, the font style and size as well as the picture and table should be in their unified formats, respectively. Readability by format accounts for 50% of total scores.

Plagiarism. Everyone should finish his curriculum design report on his own. If any two reports are found identical, all will be penalized with the same measurement.

2.Curriculum Design Report. Every one must finish a report according to his selected project. The curriculum design report can be made in a word file and written in any language (either in English or Chinese). You are required to submit two versions of your curriculum design report, printed one as well as electronic one.

3.How to submit electronic curriculum design report.

请参照:数据库系统课程设 2提交电子档案模板.

A blank page is left here

From the following page, your work is presented



Project ?:(please give your project no and name)

Contents (内容目录)


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp