CMPT 456 Assignment 3

Due: 11:59 pm, July 17, 2020

100 points in total

Please submit your assignment in Coursys.

Every student has to complete the assignment independently. While you are encouraged to learn

through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious

violation of the university’s academic integrity policy. We have absolutely zero tolerance of such

behavior.

This assignment covers the materials in the sections of queries and interfaces, and information

retrieval models. The questions use the data sets D1 and D2 in Q1 of Assignment 2. Please also

merge D1 and D2 to form data set D. Do NOT stem the data sets.

Question 1 (50 points)

This question asks you to build a simple spelling correction function.

Your program should take a word w as input. It searches all words x in D that have edit distance

edit-dist(w, x) <= 2, where each of the editing operations, insertion, deletion, and replacement,

takes weight 1. Then, your program should output the top-5 words x that have the highest termfrequency

in D as the suggested corrections as well as their term-frequencies in D.

1. Describe your implementation of the above spelling correction function and use 3

examples to show the correction results. Particularly, explain how you can find those

candidate corrections and the top-5 candidates efficiently. Please submit your codes. (20

points)

2. The above design has a problem. If you input a correctly spelt word, like “heart”, it still

may suggest some similar words as possible corrections, such as “heard”. Can you

propose a fix to the problem so that your program can avoid suggesting corrections of

words frequently used in D? Describe your idea use examples, implement it, and test the

results. Does your program need any parameters? If so, discuss how those parameters

may contribute to or affect the performance of the program, that is, how those

parameters are tuned. Please submit your codes. (10 points)

3. A data set of tweets may often contain misspelt words. Your spelling correction program

trained from D may be used on D to improve the data quality. Apply your program to D

itself to catch misspelt words and correct them. Show the corrections you program makes.

Does your program make some mistakes in correction? Use the first 100 corrections your

program makes (or as many as possible if your program makes less than 100 corrections)

to calculate the precision. Explain (1) how your program picks the correction of a word

among all possible candidates; and (2) how your program avoids changing every word in

D, particularly those words correctly spelt but infrequent in D. Submit your codes. (20

points)

Question 2 (25 points)

This question asks you to use language modelsto understand the differences between the usages

of words in D1 and D2. Please remove the stop words in D1 and D2.

For a word w that appears in document D1 (D2), we collect all the tweets containing w and form

a document ????!

"! (????!

"" ). Compute the language model ????!

"! (????!

"" ) of ????!

"! (????!

"" ). Please use

Dirichlet smoothing, use ????# ∪ ????$ as the collection and set ???? = 25. For the top-100 most

frequent words in D, compute ????????(????!

"" || ????!

"!) and sort the words in KL-divergence descending

order. Please submit your list of the words, the KL-divergence values and your codes.

Question 3 (25 points)

This question asks you to conduct clustering on the tweets.

D2 has 1000 tweets. Combine the first 100 tweets into one document ????$

#, combine the next 100

tweets into one document ????$

$, …, combine the last 100 tweets into one document ????$

#%. Let the

language models on ????$

#,????$

$, … ,????$

#% be ????""

!, ????""

", … , ????""

!#, respectively.

For each tweet t in D1, we can form a vector ????? = ?????#, ????$, … , ????#%?, where ????& = ???????? 6????'||????""

$ 7,

and ????' is the language model of t. Again, please use Dirichlet smoothing, use ????# ∪ ????$ as the

collection and set ???? = 25. The vectors should be normalized.

For any two tweets ????# and ????$ in D1, the similarity between them is the cosine similarity between

the two normalized associated vectors ????#888? and ????$888? . Run the k-means algorithm to cluster all tweets

in D1. You can learn the k-means algorithm on pages 382-384 in the textbook. You can also use

any implementations or libraries, such as scikit in Python <https://scikit-learn.org/stable/>, to

implement the k-means algorithm. Report your results when k is set to 3, 5, and 10, respectively.

For each cluster, report the number of tweets and the top-5 most frequent words in the tweets

in the cluster. Submit your codes as well.

Hint: many standard k-means clustering packages, such as scikit, uses Euclidean distance as

standard. There is a linear connection between cosine similarity on normalized vectors and

Euclidean distance. You can see an explanation at

https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-inkmeans

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。