Coursework Assessment Pro-forma
Module Code: CMT309
Module Title: Computational Data Science
Assessment Title: CMT309 Programming Exercises
Assessment Number: 3
Date set: 06-03-2020
Submission date and time: 08-05-2020 at 9:30 am
Return date:
This assignment is worth 40% of the total marks available for this module. If coursework is
submitted late (and where there are no extenuating circumstances):
1 - If the assessment is submitted no later than 24 hours after the deadline, the
mark for the assessment will be capped at the minimum pass mark;
2 - If the assessment is submitted more than 24 hours after the deadline, a mark
of 0 will be given for the assessment.
Your submission must include the official Coursework Submission Cover sheet, which can
be found here:
https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf
Submission Instructions
Your coursework should be submitted via Learning Central by the above deadline. You have
to upload the following files:
Description Type Name
Cover sheet Compulsory One PDF (.pdf) file Student_number.pdf
Your solution to question 1 Compulsory One Python (.py) file Q1.py
Your solution to question 2 Compulsory One Python (.py) file Q2.py
Your solution to question 3 Compulsory One Word (.docx) file Q3.docx
For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g.
“C1234567890.pdf”. Make sure to include your student number as a comment in all of the
Python files! Any deviation from the submission instructions (including the number and types
of files submitted) may result in a reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt
will be marked, so make sure that you upload all files in the last attempt.
Assignment
Start by downloading the following files from Learning Central:
• Q1.py
• acronym_example1.txt, acronym_example2.txt, acronym_example3.txt,
acronym_example4.txt, acronym_tuples.txt
• Q2.py
• Q3.py
• Q3.docx
Then answer the following questions. You can use any Python expression or package that was
used in the lectures. Additional packages are not allowed unless instructed in the question.
Question 1 - What is the long form of the acronym? (Total 35 marks)
In this question, your task is to implement several functions that parse text strings for
acronyms and their long forms. Acronyms are abbreviations typically formed from the initial
letters of multiple words and pronounced as a word. For instance, the acronym "GPU" stands
for the long form "graphics processing unit". In this question, an acronym is defined as a
character sequence of at least two successive capital letters. Your task is to implement several
functions that together parse a text for acronyms and find their long forms.
As an example text, let us define the string
s = "A GPU, which stands for graphics processing unit, is different
from CPUs, says the IT expert. For some operations, a GPU is faster
than a CPU. GPUs are not always faster though."
Q1 a) Parse acronyms (10 marks)
Write a function read_file(filename) that receives as input a filename. The filename
includes the filepath. The function returns the entire content of the file as a single string.
Write a function find_acronyms(s) that receives as input a string s representing the text.
The function returns a list of acronyms. For our example above, find_acronyms(s) returns
the list ['GPU', 'CPU', 'IT']. Note: It is not important in which order the acronyms
appear in the returned list.
Q1 b) Find the long forms (15 marks)
In this question the hard work is done: given the acronyms, your task is to find their long
form in the text. To this end, write a function find_long_forms(s, acronyms). It receives
as input a string s representing the text and a Python list of acronyms. The function returns
a dictionary d with key-value pairs, where the key is the acronym and the value is its long
form. For instance, in our example above the output is the dictionary d = {'GPU' :
'graphics processing unit', 'CPU' : None, 'IT' : None}.
You can make the following assumptions:
• The long form is found in the same sentence as the acronym itself.
• If the acronym occurs multiple times in a text, its long form is found in the first
sentence that contains the acronym.
• Every '.' (dot) marks the end of a sentence. Sentences like "I talked to the Dr. and
raised my concerns." where dots are contained within the sentence will not occur.
• The first letter of the acronym is the same letter as the first letter of the first word of
the long form. All of the letters in the acronym need to appear in the long form.
• If no long form can be found for an acronym, it is set to None (Python's None type)
as in the dictionary above.
Four examples for texts with acronyms are given in the example files
acronym_example1.txt, acronym_example2.txt etc. The corresponding tuples
of (acronym, long form) are specified in the file acronym_tuples.txt.
Q1 c) Replace acronyms by long forms (10 marks)
Assume we want to make the document more self-explanatory and replace its acronyms with
their corresponding long forms. To this end, write a function replace_acronyms(s, d). It
receives as input a string s representing the text, and a dictionary d which contains <acronym
: long_form> key-value pairs as defined in Q1b). The function returns another string as
output. In this output, all acronyms in s have been replaced with their long forms. The
following rules apply:
- If an acronym has a long form, the sentence wherein the long form was defined
remains unchanged. For any other sentence, the acronym is replaced by the long form.
- If an acronym has no long form, it is not replaced anywhere.
- If you add the long form at the beginning of a sentence, make sure that its first word is
capitalised.
For instance, in our example above the output of the function is the string:
"A GPU, which stands for graphics processing unit, is different from
CPUs, says the IT expert. For some operations, a graphics processing
unit is faster than a CPU. Graphics processing units are not always
faster though."
As a starting point, use Q1.py from Learning Central. Do not rename the file or the function.
Q2 Statistics (Total 35 marks)
In this question, your task is to implement several statistical functions that perform t-tests,
linear regression, and variable selection.
Q2 a) Mass t-tests (10 marks)
In this question, your task is to implement two functions that perform dependent and
independent t-tests on input data. You can use the corresponding t-test functions in
scipy.stats.
Write a function mass_paired_ttest(X) that performs a series of paired-samples t-tests. It
receives as input a numpy array X with dimensions ��, where � is the number of rows and
� is the number of columns. Each of the � columns represents one sample. Your function
should find the pair of columns that yields the lowest p-value i.e. it is the 'most significant'.
Then the function returns a tuple with three elements (index of the first column from the pair,
index of the second column from the pair, corresponding p-value).
Example: imagine your dataset is of (100, 3) shape i.e. has three columns. Assume the pvalues
for the three pairs of colums are p = 0.4 (col 0 vs col 1), p = 0.12 (col 0 vs col 2), p =
0.08 (col 1 vs col 2). The lowest p-value is obtained for col 1 vs col 2 and its value is 0.08, so
the tuple that is returned by the function is t = (1, 2, 0.08).
Write a similar function mass_independent_ttest(*X) that performs a series of
independent t-tests. It takes multiple inputs: Each input is a vector (1-D array) representing
a single sample, so X is a list of Numpy arrays. The arrays can have different lengths. You
can access each array using its index, e.g. X[0] is the first array, X[1] is the second array etc.
Like for the paired-samples t-test, find the most significant pair of columns and return the
tuple of three elements.
Q2 b) Ridge regression (10 marks)
In this question your task is to implement ridge regression from scratch using Numpy. Do not
use statsmodels or scipy for this question. Ridge regression is a slightly modified version
of linear regression which is more stable for collinear data.
Your task: Write a function fit_ridge(y, X, a) that implements ridge regression as
defined above. It receives the following inputs:
- The response vector y is a numpy array with shape (n,1).
- The matrix X is a numpy array of predictors with shape (n, p). Note that X does not
contain the column of 1's, so you need to add it yourself.
- The input a represents the strength of regularization. a can be either a single number
(e.g. a = 1) or a list with multiple numbers (e.g. a = [1, 5, 10]).
If a is a single number, the function returns, the ridge regression coefficients using a for
the regularization. If a is a list with multiple numbers, separate ridge regression solutions
should be calculated for each value of a. In this case, the function returns a Python list of
vectors of regression coefficients [0,2,...], where 0 is the regression coefficients
using the first value of a, 1 is the regression coefficients using the second value a, and so
on.
Tip: remember than the * operator operates element-wise on Numpy arrays. If you want
proper matrix or vector multiplication like in linear algebra, you can use the @ operator.
Q2 c) Variable selection in linear regression (15 marks)
In this question, your task is to use statsmodels to implement two variable selection
functions for standard linear regression (a.k.a. OLS regression). The motivation is that
regression models can have dozens or even hundreds of predictors. This can make it difficult
to interpret the relationship between the predictors and the response variable y. Ideally, one
wants to identify a subset of the predictors that carries most of the information about y. A
possible approach is variable selection. In variable selection (‘variable’ means the same as
‘predictor’), variables get iteratively added or removed from the regression model. Once
finished, the model typically contains only a subset of the original variables.
In the following, we will call a predictor "significant" if the p-value of its coefficient is
smaller or equal to a given threshold. Your approach operates in two stages: In stage 1, you
iteratively remove predictors that are not significant. This leaves you with a subset of the
original predictors. In stage 2, you iteratively add interaction terms and keep them in the
model if they are significant. Remember what an interaction term is: if 1 and 2 are two
predictors, then the variable = 1⋅ 2 is their corresponding interaction term. We will split
the two stages into two functions:
Stage 1 (remove variables)
Write a function remove_variables(y, X, threshold = 0.05, variable_names =
None). The function receives the following inputs:
• y and X are numpy arrays like in Q2b).
• threshold is the cut-off value that determines whether a p-value is significant. If a pvalue
<= threshold, it counts as significant.
• variable_names is a Python list of variable names that a user can provide. This is the
names for the columns of X (e.g. ['TV', 'radio', 'newspaper'] for the advertisement
dataset discussed in the lecture). If no variable names are provided, your function
should create the variable names ['x1', 'x2’, ‘x3’, ...] where 'x1' is the name for the
first column of X, 'x2' is the name for the second column of X, and so on.
The function returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after non-significant variables have been removed. It
should not include the column of 1’s corresponding to the intercept.
• new_variable_names is a list of strings containing the variable names for the
columns of new_X.
Use the statsmodels function add_constant to make sure that X contains a column of 1's for
the intercept, and use the intercept in all fits. Next, these are the details on how to implement
the two stages of variable selection:
• To start, fit an OLS model using all of the predictors in X.
• Identify the predictor whose coefficient has the largest p-value. If it is not significant,
remove it and fit the model again.
• Repeat this process until either all predictors have been removed or all predictors left are
significant.
• Never remove the intercept irrespective of whether or not it is significant.
• If no predictors are left after stage 1, return the tuple (None, None).
Tip: It might be useful to use Boolean arrays to select subsets of columns of X.
Stage 2 (add interaction terms)
Write a function add_interaction_terms(y, X, threshold = 0.05, variable_names
= None). The inputs have the same meaning as in remove_variables. The function
returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after the interaction terms have been added. Hence,
it contains the predictors in X plus the interaction terms that have been added as new
columns to the right. It should not contain the column of 1’s corresponding to the
intercept term.
• new_variable_names is a list of strings containing the variable names for the
columns of new_X. For interaction terms, use names that combine the two variable
names with a ‘*’ sign. For instance, if you add the interaction term for ‘tv’ and
‘radio’, then call their interaction variable ‘tv*radio’.
The function implements the following algorithm:
• To start, fit an OLS model using all of the predictors in X.
• Test whether it is useful to add interaction terms: For each pair of predictors, add their
interaction term into the model. If the interaction term is significant, keep it in the model.
If it is not significant, remove it again.
• Continue this until you checked every pair of predictors.
• Never add an interaction term involving the intercept.
• It can happen that when you add new interaction terms, predictors that you previously
added become non-significant. You can ignore this issue.
• Add the interaction terms in order, starting from the leftmost predictor in X. For instance,
if you have predictors with column indices 1, 2, 3, and 4, you first add the [1, 2]
interaction term, then [1, 3], [1, 4], [2, 3], [2, 4], and finally [3, 4].
• After you checked the interaction terms for all pairs of predictors, you are finished.
Return the new set of predictors and variable names as defined above.
Finally, note that it should be possible to run both functions one after the other. For instance,
given y and X, the following two lines of code
(new_X,new_variable_names)=remove_variables(y, X)
(new_X,new_variable_names)=add_interaction_terms(y, new_X, variable_names=new_variable_names)
should first perform removal of variables and then add interaction terms.
As a starting point, use Q2.py from Learning Central. Do not rename the file or the function.
Question 3 – Ethics (Total 30 Marks)
In this question you will investigate bias in text corpora (document collections). You are
provided with two datasets from a recent data science competition on Hyperpartisan News
Detection [1]. These datasets are
- bias_corpus.txt: a corpus of news articles from media that have been
classified as exhibiting right or left political bias.
- nobias_corpus.txt: a corpus of news articles that have been classified
as neutral.
These newspaper articles are mostly written in the context of US politics. They could be used
for building targeted political ads (reader of newspaper X will prefer to see ads of party Y),
user or community profiling, etc. However, some articles may depict certain protected
communities (women, immigrants or LGBT) in a negative way. This may bias any data
science model built on top of this data.
In this question you implement a 'pattern matching' procedure for investigating how protected
communities are depicted in both corpora (biased vs non-biased). As an inspiration, you can
start experimenting with Hearst patterns [2], which are often used to identify word pairs in
which a type-of relationship holds. An example for a Hearst pattern is ‘X is a type of Y’.
The slots X and Y will be filled with matches in corpora, e.g., ‘cat is a type of animal’
or ‘sofa is a type of furniture’. Such patterns can also be used to reveal how certain
communities are depicted. For example, ‘immigrants and other x’ would reveal how
immigrants are depicted in these media. A neutral example could be ‘immigrants and
other communities’, whereas a (negatively) biased example could be ‘immigrants and
other criminals’. An initial list of patterns is provided below (with actual examples from
the data). However, you are free and encouraged to experiment with text patterns of
your own design. You can experiment using only X, only Y, or both X and Y as empty slots
(regex groups).
Pattern Example occurrence
X is a Y Obama is a citizen
X is Y Trump is threatening
X and other Y Refugees and other criminals
marginalized Y, especially X marginalized groups, especially gays
X works as a Y He works as a manager
or
She works as a hairdresser
Your tasks:
• Download and uncompress the text corpora from this url:
https://drive.google.com/drive/folders/1ATp_zALwRRG5-
rd9o0WEcP9IXKOGADSd?usp=sharing
• Decide on a person or community of interest. Define a pattern which you hypothesize
is likely to reveal how this person/community is depicted. This pattern could include
regular expressions and group matching for a slot x to fill. For example, the pattern
‘Trump is x’ is likely to match more verbs in non-biased media because they talk
more about what he does (‘Trump is speaking’ or ‘Trump is attending’). In
biased media, however, we could find more adjectives (‘Trump is arrogant’ or
‘Trump is great’).
• Retrieve and count the hits you get for each value of x in the biased and the non
biased corpora separately and store those in the dbias and dnobias dictionaries. For
example, if ‘Trump is speaking’ occurs twice in the non-biased corpus, then the
dictionary entry dnobias['speaking'] has the value 2.
• Do this pattern extraction process for three different persons/communities to obtain a
total of three case studies. You can use the same or different patterns.
Then, report your results in the template document Q3.docx as follows: For each of the three
case studies, write a short justification (up to 300 words for each) with
1. Your initial hypothesis, why you chose the pattern and no other, how many patterns
you tried for the case you wanted to test, etc.
2. Provide the comparison match frequency table for the two corpora (see example,
provided as a comment, at the end of Q3.py).
3. Discuss differences, if any, between the results obtained from the two corpora, and
highlight the stereotypical or negative depictions you found.
As a starting point, use Q3.py and Q3.docx from Learning Central. Your submission should
only include the Word document, not the Python script.
References:
[1] Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., ... & Potthast, M. (2019,
June). Semeval-2019 Task 4: Hyperpartisan news detection. In Proceedings of the 13th
International Workshop on Semantic Evaluation (pp. 829-839).
(Available at https://www.aclweb.org/anthology/S19-2145/)
[2] Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text corpora.
In Proceedings of the 14th conference on Computational Linguistics-Volume 2 (pp. 539-545).
Association for Computational Linguistics.
(Available at https://www.aclweb.org/anthology/C92-2082/)
Learning Outcomes Assessed
• Carry out data analysis and statistical testing using code
• Critically analyse and discuss methods of data collection, management and storage
• Reflect upon the legal, ethical and social issues relating to data science and its
applications
Criteria for assessment
Credit will be awarded against the following criteria. The score in each implemented function
is judged by its functionality. For Q1 and Q2, the functions you have implemented will be
tested against different data sets to judge their functionality. Additionally, quality and
efficiency (Q1) will be assessed. For Q3, marks are based on the written report. The below
table explains the criteria.
Criteria Distinction
(70-100%)
Merit
(60-69%)
Pass
(50-59%)
Fail
(0-50%)
Q2 Functionality
(85%)
Excellent working condition
with no errors
Mostly correct.
Minor errors in
output
Major problem.
Errors in output
Mostly wrong or
hardly implemented
Quality (15%) Excellent documentation with
usage of __docstring__ and
comments
Good documentation
with minor missing
of comments.
Fair
documentation.
No comments or
documentation at
all
Criteria Distinction
(70-100%)
Merit
(60-69%)
Pass
(50-59%)
Fail
(0-50%)
Q1 Functionality
(70%)
fully working application that
demonstrates an excellent
understanding of the assignment
problem using relevant python
approach.
All required functionality is
met, and the application are
working probably with some
minors’ errors
Some of the
functionality
developed with and
incorrect output major
errors.
Faulty application with wrong
implementation and wrong
output
Efficiency
(15%)
Excellent performance using a
concise and elegant solution
Good performance using a
concise and appropriate
solution
Partial performance
showing an appropriate
approach
Incorrect or highly inefficient
approach
Quality (15%) Excellent documentation with usage
of __docstring__ and comments
Good documentation with
minor missing of comments.
Fair documentation. No comments or documentation
at all
Criteria Distinction
(70-100%)
Merit
(60-69%)
Pass
(50-59%)
Fail
(0-50%)
Q3 All patterns and associated
reflections implemented.
Strong presentation of the
hypothesis and discussion of
the results.
All patterns and associated
reflections implemented.
Strong presentation of the
hypothesis and discussion of
the results, minor degree of
overlap between the issues
encountered and discussed.
Several protected
communities addressed.
All patterns and associated
reflections implemented.
Weak presentation of the
hypothesis and discussion of
the results, high degree of
overlap between the issues
encountered and discussed.
Overall report reflects poor
understanding of the task.
No or incomplete patterns,
or with major flaws, or
missing reflection.
Discussion of results based
on factually incorrect or non
existing data.
Feedback and suggestion for future learning
Feedback on your coursework will address the above criteria. Feedback and marks will be
returned within 4 weeks of your submission date via Learning Central. In case you require
further details, you are welcome to schedule a one-to-one meeting. Feedback from this
assignment will be useful for next year’s version of this module as well as the Python for Data
Analysis module.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。