CS 610 Spring 2019
Performance of Skip Lists
Course Project Option 1
Due May 3
The goal of this project is to analyze the performance of Skip List implementation of Ordered Dictionary Chapter 7 Assignment Due: To Be Determined by Class Vote by the End of the First Week of Teaching The penalty for late hand-in of coursework is 10% of the total marks per working day. No credit will be given after more than 5 working days. A working day is a 24-hour period starting from the original hand-in deadline, but skipping days on which the School is closed. 7.1 Individual Assignment (40%) We will study the so-called South-African Heart-Disease (SAHD) data. Necessary background is in [3, Section 4.4.2]. Briefly, observations on n = 462 individuals are available. On each individual, the data consists of: a binary indicator (value 0 or 1) of the presence of Coronary Heart Disease (1=have disease, 0=does not have disease), viewed as a response; and the following raw inputs: systolic blood pressure; tobacco use; LDL, also known as “bad cholesterol”; (binary) indicator of family history of heart disease; (measure of) obesity; (measure of) alcohol use; and age. In data file SAHD.mat, the response variable is chd, and the inputs are sbp, tobacco, ldl, famhist, obesity, alcohol, age respectively. We want to regress the response on inputs, where an input may be a raw input or a function of it (e.g., a quadratic function). The main tool is a logit (logistic) regression of the form Yi ~ Bernoulli(μi) and independent, where logit(μi) = X p j=1 Xj,iβj (7.1) where logit(μ) = log(μ/(1μ)), and (Yi , X1,i, . . . , Xp,i) is the i-th observation of the response and the inputs. Another possibility is probit regression of the form Yi ~ Bernoulli(μi) and independent, where μi = Φ(X p j=1 Xj,iβj ), (7.2) where Φ() is the Normal(0,1) cdf. [3, Section 4.4.2] fit logit models of the form (7.1). In [3, Section 5.2.2, pages 146-148], a more advanced model in equation (5.6) there captures the effect of a raw input Xj on the response by a spline (piecewise-polynomial) function hj (Xj ); the (estimated) functions are shown in Figure 5.4 for selected inputs. These findings suggest the following: ? Variables sbp, tobacco, and obesity may appear in (7.1) in quadratic form (b1x + b2x 2 , where x is the raw input) or in cubic form (b1x + b2x 2 + b3x 3 ). To capture the effect of age, which does not seem to resemble a low-order polynomial, consider as inputs the binary indicators of categorized age, similar to what is done in [5, Example 10]. For example, the sample deciles of age, namely 18, 28, 34, 40, 45, 49, 54, 58, 61, and 64, define the intervals (0, 18], (18, 28], (28, 34], ..., (61, 64], which give 10 categories of age. 47 Standard tasks (model estimation; testing the significance a certain coefficient; computing the AIC; etc.) may be done via sahd.m Part 1, which is similar to sco.m (lab 3, Section 6.2.2). (Part 2 supports the group assignment.) A very thorough exploration could involve transformed raw inputs beyond those seen there. Those wishing to have matlab on a non-University computer should start at software.soton.ac.uk. Install at least matlab (the basic product); and the toolboxes on statistics and optimization. 7.1.1 Deliverables and Marking Scheme Submit a typed report to the Faculty Office, with the usual cover about academic integrity. The audience is a hypothetical manager or analyst familiar with our notes, the brief and the analyzes in [3] cited above. Discuss results, and analysis as necessary, on the following: 1. Develop a preferred model of the response. A better (more parsimonious) model is often found by removing an input Xj if the hypothesis “βj = 0” cannot be rejected at some level α (the corresponding t value satisfies |t| < Φ 1 (1 α/2), where Φ1 () is the Normal(0,1) inverse cdf); and then re-fitting the model. Use of a model-selection criterion, such as Akaike Information (AIC), is recommended. 2. Use the preferred model to describe the effect of each of obesity and age on the likelihood of heart disease. For background on interpretation, see Section 4.5.4 and the second-fromlast paragraph in [3, Section 4.4.2]. 3. Brief explanation of how results were obtained can appear in an appendix. If significant extensions are made to any codes provided to you during the course (sahd.m or other), then: (a) submit the code on blackboard (individual assignment), and state in the report the submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b) provide instructions on how the code can be used. Avoid re-stating the brief or the codes provided. Length: up to 400 words, excluding tables, figures, and their captions. Marking Scheme: 40%. Plausibility of the proposed model. The plausibility will be judged against the findings in [3] and against experiments the manager could attempt; thus your analysis should be informed by the above. For example, while Table 4.3 of [3] suggests that obesity and systolic blood pressure do not have a significant effect, the reverse is found in [3, Figure 5.4]; the latter suggests that appropriate functions of these inputs should enter the regression (e.g., as a quadratic function, or a categorization). 30%. Correctness of estimation of any model being proposed. 30%. Quality of presentation (clarity, coherence, ease of reading). 7.2 Group Assignment (60%) We wish to develop a classification (prediction) method of the response (0/1 indicator of heart disease) based on the inputs. The aim is to minimize the expected loss (per observation, as in (5.7)) under the following cost structure: correct classification costs nothing; misclassification of class 0 (true is 0, prediction is 1) costs 1 unit; and mis-classification of class 1 (true is 1, prediction is 0) costs 10 units. Similar to the analysis seen in lab 4, we consider classifiers defined by a Bernoulli (GLM) model of the response and a classification threshold t, as in (6.1); and the main aim is to minimize the expected loss by choice of a classifier. The more extensive the set of models and thresholds considered, the smaller the loss we can expect. Moreover, a “better” model (with an appropriate threshold) is more likely to minimize the loss; thus, the model chosen for the individual assignment seems (a priori) a strong contender. 48 Script sahd.m, Part 2, implements an analysis of the form above, similar to cvsco.m (lab 4). Specifically, a set of classifiers is developed; the expected (out-of-sample) loss is estimated via cross-validation (section 5.3, equation (5.10)); and the classifier minimizing the estimated loss is selected. A second (subsequent) task is to estimate the expected loss, L say, of the selected classifier. The CV of the selected classifier (minimum CV across all classifiers considered) is an optimistic (biased low) estimate of L [3, Section 7.2, page 222]. To obtain an unbiased estimate, the following is proposed: (a) on the first two thirds of the data, apply cross-validation for classifier selection; and (b) on the last third of the data, the selected classifier is applied; the average loss is an unbiased estimate of L. For this task, minor modification of sahd.m (or a new code) may be necessary. 7.2.1 Deliverables and Marking Scheme Submit a typed report to the Faculty Office, with the usual cover about academic integrity. The audience is a hypothetical manager or analyst familiar with our notes, the brief and the analyzes in [3] cited above. Discuss results, and analysis as necessary, on the following: 1. State a preferred classifier, with the aim of minimizing the expected loss. In estimating the loss, leave-one-out cross-validation (Section 5.3) seems preferable (the modest size of the sample makes this practical). 2. Provide an unbiased estimate of the expected loss L of the selected classifier. 3. Brief explanation of how results were obtained can appear in an appendix. If significant extensions are made to any codes provided to you during the course (sahd.m or other), then: (a) submit the code on blackboard (group assignment), and state in the report the submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b) provide instructions on how the code can be used. Avoid re-stating the brief or the codes provided. Length: up to 500 words, excluding tables, figures, and their captions. Marking Scheme: 50%. Thoroughness of exploration of potential classifiers, and success in minimizing the expected loss. 20%. Correctness of estimation of the expected loss L of the selected classifier. 30%. Quality of presentation: clarity, coherence, ease of reading. Group Work Work in a group of size up to 4. Each group makes a single submission of the deliverables. Group members are responsible for functioning as a team. A common mark will be given to the members of a group. In exceptional cases where there is strong evidence of lack of contribution by a member, a correction may be made. A key requirement for making such a correction is majority opinion (i.e., 2 or more against 1). To help resolve such cases, keeping minutes of work meetings may be appropriate. 49
First define an interface for Ordered Dictionary ADT with the required methods and then provide the Skip list implementation. You should give your own implementation of Skip list (not borrowed from any web site or from anyone else).
Second part of the project requires you to write a program to test the performance of this implementation for findElement, insertElement, removeElement, closestKeyAfter operations. Specifically, you should generate randomly a sequence of unique integers in a range (say 1 to 100000) to be inserted as keys starting from an empty ADT. You should intersperse insert operations with equal number of findElement, removeElement and closestKeyAfter operations in the sequence. You should make sure there will be sufficient successful and unsuccessful searches and valid delete operations. One approach is as follows : Suppose you have a “findElement” operation as m-th operation in the sequence and let … be the keys inserted until then pick a number “t” randomly between 1 and 2m+1 and use key if t = 2l, key if t = 2l-1, for 1 ≤ l ≤ m and key + 1 if t = 2m+1. You can use same approach for “closestKeyAfter” operation. For “removeElement”, if … be the keys inserted until then pick a number “t” randomly between 1 and m and choose key if t = l.
You should run this sequence of these operations many times and also keep statistics to measure average time for the four ADT operations as a function of number of keys in the tree at the time when operation was executed. This will enable you to plot the average time as a function of input size and make conclusions about the asymptotic growth of its performance (big-oh notation).
You should implement it in either C++ or Java (preferable). For Java, submit a zip file containing (a) source code, (b) Jar file containing .class files, (c) a Windows command file that I can run to test the first part of the project by entering a sequence of ADT operations with key values when prompted and observing the skip list configuration printed by the program, (d) a Windows command file that I can run to execute the test program of second part of the project and see the performance results in a tabular form and (e) one-page summary of your findings regarding the asymptotic complexity analysis of for the various ADT operations along with the performance graphs to support your conclusions. For C++, instead of (b), submit the executables that can be run on an AFS UNIX host and for (c) and (d), submit Unix shell scripts to run the test programs.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。