CMPT 456 Assignment 3
Due: 11:59 pm, July 17, 2020
100 points in total
Please submit your assignment in Coursys.
Every student has to complete the assignment independently. While you are encouraged to learn
through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious
violation of the university’s academic integrity policy. We have absolutely zero tolerance of such
behavior.
This assignment covers the materials in the sections of queries and interfaces, and information
retrieval models. The questions use the data sets D1 and D2 in Q1 of Assignment 2. Please also
merge D1 and D2 to form data set D. Do NOT stem the data sets.
Question 1 (50 points)
This question asks you to build a simple spelling correction function.
Your program should take a word w as input. It searches all words x in D that have edit distance
edit-dist(w, x) <= 2, where each of the editing operations, insertion, deletion, and replacement,
takes weight 1. Then, your program should output the top-5 words x that have the highest termfrequency
in D as the suggested corrections as well as their term-frequencies in D.
1. Describe your implementation of the above spelling correction function and use 3
examples to show the correction results. Particularly, explain how you can find those
candidate corrections and the top-5 candidates efficiently. Please submit your codes. (20
points)
2. The above design has a problem. If you input a correctly spelt word, like “heart”, it still
may suggest some similar words as possible corrections, such as “heard”. Can you
propose a fix to the problem so that your program can avoid suggesting corrections of
words frequently used in D? Describe your idea use examples, implement it, and test the
results. Does your program need any parameters? If so, discuss how those parameters
may contribute to or affect the performance of the program, that is, how those
parameters are tuned. Please submit your codes. (10 points)
3. A data set of tweets may often contain misspelt words. Your spelling correction program
trained from D may be used on D to improve the data quality. Apply your program to D
itself to catch misspelt words and correct them. Show the corrections you program makes.
Does your program make some mistakes in correction? Use the first 100 corrections your
program makes (or as many as possible if your program makes less than 100 corrections)
to calculate the precision. Explain (1) how your program picks the correction of a word
among all possible candidates; and (2) how your program avoids changing every word in
D, particularly those words correctly spelt but infrequent in D. Submit your codes. (20
points)
Question 2 (25 points)
This question asks you to use language modelsto understand the differences between the usages
of words in D1 and D2. Please remove the stop words in D1 and D2.
For a word w that appears in document D1 (D2), we collect all the tweets containing w and form
a document 𝑑𝑑!
"! (𝑑𝑑!
"" ). Compute the language model 𝑀𝑀!
"! (𝑀𝑀!
"" ) of 𝑑𝑑!
"! (𝑑𝑑!
"" ). Please use
Dirichlet smoothing, use 𝐷𝐷# ∪ 𝐷𝐷$ as the collection and set 𝜇𝜇 = 25. For the top-100 most
frequent words in D, compute 𝐾𝐾𝐾𝐾(𝑀𝑀!
"" || 𝑀𝑀!
"!) and sort the words in KL-divergence descending
order. Please submit your list of the words, the KL-divergence values and your codes.
Question 3 (25 points)
This question asks you to conduct clustering on the tweets.
D2 has 1000 tweets. Combine the first 100 tweets into one document 𝐷𝐷$
#, combine the next 100
tweets into one document 𝐷𝐷$
$, …, combine the last 100 tweets into one document 𝐷𝐷$
#%. Let the
language models on 𝐷𝐷$
#,𝐷𝐷$
$, … ,𝐷𝐷$
#% be 𝑀𝑀""
!, 𝑀𝑀""
", … , 𝑀𝑀""
!#, respectively.
For each tweet t in D1, we can form a vector 𝑡𝑡⃗ = ⟨𝑥𝑥#, 𝑥𝑥$, … , 𝑥𝑥#%⟩, where 𝑥𝑥& = 𝐾𝐾𝐿𝐿 6𝑀𝑀'||𝑀𝑀""
$ 7,
and 𝑀𝑀' is the language model of t. Again, please use Dirichlet smoothing, use 𝐷𝐷# ∪ 𝐷𝐷$ as the
collection and set 𝜇𝜇 = 25. The vectors should be normalized.
For any two tweets 𝑡𝑡# and 𝑡𝑡$ in D1, the similarity between them is the cosine similarity between
the two normalized associated vectors 𝑡𝑡#888⃗ and 𝑡𝑡$888⃗ . Run the k-means algorithm to cluster all tweets
in D1. You can learn the k-means algorithm on pages 382-384 in the textbook. You can also use
any implementations or libraries, such as scikit in Python <https://scikit-learn.org/stable/>, to
implement the k-means algorithm. Report your results when k is set to 3, 5, and 10, respectively.
For each cluster, report the number of tweets and the top-5 most frequent words in the tweets
in the cluster. Submit your codes as well.
Hint: many standard k-means clustering packages, such as scikit, uses Euclidean distance as
standard. There is a linear connection between cosine similarity on normalized vectors and
Euclidean distance. You can see an explanation at
https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-inkmeans
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。