Overview
A synonym is a word with a similar meaning to another word, and differences often come down to differences in the way the words are used in practice. English is rich in synonymns. For example, some synonyms for fair are: just, objective, impartial, unbiased, while for the word hardworking, some synonyms are: diligent, determined, industrious, enterprising. The way one typically looks for synonyms for a word is via a thesaurus. (The link to the Wikipedia entry provides some interesting history. The first thesaurus-like work dates back to the 4th century.)
Until relatively recently, each Thesaurus was compiled by hand, but with computerisation, a machine learning approach can be taken. In machine learning, relationships between things are learned from data - lots of it! In particular, in this project we will exploit the idea that similar words tend to be used in similar contexts, and will broaden the notion of "synonym" to mean related words. The "context" will be appearance in the same sentence in piece of text.
The following piece of text is taken from "Life on the Mississippi", but Mark Twain (Project Guttenberg edition).
``THE Mississippi is well worth reading about. It is not a commonplace river, but on the contrary is in all ways remarkable. Considering the Missouri its main branch, it is the longest river in the world--four thousand three hundred miles. It seems safe to say that it is also the crookedest river in the world, since in one part of its journey it uses up one thousand three hundred miles to cover the same ground that the crow would fly over in six hundred and seventy-five. It discharges three times as much water as the St. Lawrence, twenty-five times as much as the Rhine, and three hundred and thirty-eight times as much as the Thames. No other river has so vast a drainage-basin: it draws its water supply from twenty-eight States and Territories; from Delaware, on the Atlantic seaboard, and from all the country between that and Idaho on the Pacific slope--a spread of forty-five degrees of longitude. The Mississippi receives and carries to the Gulf water from fifty-four subordinate rivers that are navigable by steamboats, and from some hundreds that are navigable by flats and keels. The area of its drainage-basin is as great as the combined areas of England, Wales, Scotland, Ireland, France, Spain, Portugal, Germany, Austria, Italy, and Turkey; and almost all this wide region is fertile; the Mississippi valley, proper, is exceptionally so.''
The text can be found at sample.txt.
The idea is that for every unique word in the text you will compute a "profile" of associated words; to be associated, a word has to appear in the same sentence at least once. The profile for a given word, e.g. "river", records all the words that appear in the same sentence as that word, and the number of sentences each has appeard in. So, in this example, the word "river" appears 4 times. Its profile will have the word "idaho", which appears once, and the word "drainage-basin", which appears once in the context of "river" (and a second time in a sentence that does not also have the word "river"). What your program will be doing is comparing profiles.
Some things to note:
A word is any sequence of letters bounded by the start of a sentence, the end of the sentence, blank space or punctuation marks, which include comma, single-quote, double-quote, colon, semin-colon and round or square brackets. Punctuation marks can be treated as space characters.
The end of a sentence is marked by a full stop, an exclamation mark or a question mark, but otherwise can be ignored.
Note that every word has to be converted to lower case so words appear the same even if they appear at the start of sentences.
You will notice that the word "the" appears in every sentence. This is not very useful as part of a profile, so, apart from a text corpus, from which the profiles are obtained, your program will also be given a list of common words such as "and", "the", "but", and so on, one word per line. The intention is that these words should be ignored as you build (and then compare) profiles. A sample list of common words can be found at: common.txt. Other lists of common words may be substituted, depending on context.
Your program will read sets of words, one per line, terminated by a blank line. Each such set is one experiment; another experiment can follow after the blank line, and so on. For example:
river mississippi water spain cairo
The first word of the set will be a target word. The other words will be test words. The aim is, by comparing profiles of the test words against the profile of the test word, a ranked list will be computed of the test words, in descending order of similarity with the target word. The first word in the ranked list will then be announced as the synonym for target word
Scoring Profiles
The metric I would like you to use to score the similarity between two profiles is a modified cosine metric.
where p and q are two profiles, e.g. due to "river" and "mississippi". The numerator is the inner product between the two profiles, but many of the words in one profile won't be in the other profile, so the product of the corresponding terms will be 0. The numerator therefore just covers the words shared by the two profiles, where the count for a word from the first profile is multiplied by the corresponding count from the other profile. For example, for the profiles "river" and "mississippi" the shared words are "water" and "drainage-basin". The list of shared words can be represented as: [('water', 1, 1), ('drainage-basin', 1, 1)], where the two integers after each word are the counts of that word in the respective profiles. The numerator is the sum of the products, i.e. 2.
The denominator is the square-root of the product of the sums of squares of each of terms in the respective profiles. For example, for "mississippi" is will be: 12 (for "spain") + 12 (for "carries") + .... and so on.
Tasks
The only function definition that you must have is:
def main(corpus_file_name, commonwords_file_name = None)
The names of the parameters are not relevant, but the function with that name and the arguments prepresenting those file names must be present so my testing program will be able to interact properly with your program. (The file of common words is optional.) If main() definition is not present, it will not be possible to test your program.
Otherwise, how you structure your program is up to you. The tasks your program will have to undertake are:
If the commonwords file is present, open the file and read the words into an appropriate data structure.
Open the corpus file and read in the lines of text.
As the file is being read in, break it up into sentences.
As each sentence is completed, break the sentence up into words.
For each unique word (other than common words) in a sentence, update the counts of the profile entries for that word with respect to all the others. For example, after reading in the first sentence in the sample, the profile associated with mississippi will contain references to "worth" and "reading", but not "is", "well" or "about" (as these are in the commonwords). Each term will now have a count of 1. Similarly, "mississippi" will have a count of 1 in the profile of "worth", and so on
I recommend you write a function that prints out the data structure you have created to represent the profiles. I won't be testing that you have such a function, but I find it good practice as you can make sure that things are the way you expect.
Process query word sets until just a blank line is read in.
Read in words, one per line until a blank line is read in. The first word is the target word.
For each word, extract the corresponding profile and compare the second and subsequent profiles to the first. Generate a score for each comparison and record the score against the corresponding word.
Sort the word-score list and print the words and the scores in order of descending score.
Print the computed synonym.
A script from a sample session can be found at: typescript. The sample session only contains a single set of target word plus query words; your program should be able to handle multiple sets (with a blank line between sets).
Hints
A couple suggestions that you may care to incorporate in your code (or not - entirely up to you):
One possible representation for the set of profiles is a dictionary of dictionaries, where the mapping is:
profile_word -> associated_word -> count for associated_word
A dictionary is a better way to represent the common words than a list, because a list needs to be traversed to find an item in the list (or to discover that the item is not there). Dictionaries can do that in a single operation. A more natural representation in this application than dictionaries are sets. Sets are beyond the scope of this unit, not because they are hard but because there simply isn't the time. There is no discussion of sets in the textbook, but there are numerous discussions available on the web, including this discussion.
As before, your program will be tested extensively, including possible error states.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。