1.We begin with a feature extraction function. The features we are going to use are called trigrams. A trigram is simply a string of three contiguous characters. For example in the string "I love computing", there are lots of trigrams ( L-2 to be precise, where L is the length of the string): ["I l"," lo","lov","ove"] are the first four of them, in sequence.
Write a function count_trigrams(document) that takes a string and returns a default dictionary with the frequency counts of the trigrams within the string (noting that if you have repeats of the same trigram in the string, the frequency will be ). Note that the output must be a default dictionary and not a standard dictionary, as it will be useful later. Note also that you should not modify the string in any way (e.g. remove punctuation, remove whitespace or convert to lower case) in calculating the frequencies.
Your code should behave as follows:
>>> count_trigrams("hel")
defaultdict(<class 'float'>, {'hel': 1.0})
>>> count_trigrams("aaaaa")
defaultdict(<class 'float'>, {'aaa': 3.0})
>>> count_trigrams("Boaty mcBoatFace.")
defaultdict(<class 'float'>, {'ty ': 1.0, 'Fac': 1.0, 'atF': 1.0, 'tFa': 1.0, 'mcB': 1.0, 'ce.': 1.0, 'cBo': 1.0, 'ace': 1.0, 'oat': 2.0, ' mc': 1.0, 'Boa': 2.0, 'y m': 1.0, 'aty': 1.0})
2.Write a function train_classifier(training_set) which takes the single argument training_set, the name of a CSV file as a string, and returns a dictionary of normalised trigram-counts (as dictionaries), that is the dictionary should have the format {lang1: trigram_counts1, lang2: trigram_counts2, ...}.
The file training_set is of the following form:
lang1,text1
lang2,text2
lang3,text3
noting that there may be more than one document per language.
For an example, see example_tset.csv, accessible as a tab at the top right. Note that the contents were taken from Wikipedia articles for the different languages. While the individual documents have been automatically stripped of a lot of the document markup, they still include some formating characters and other noise, which will form part of the trigram counts. Though we won't do anything with it here, dealing with this kind of 'noise' is an important part of the data wrangling step of data science.
We have provided a (hidden) implementation of the function count_trigrams(doc) from the previous question in hidden_lib. This function takes a document (a string) and returns a default dictionary of trigram-counts for the trigrams within the string.
Your code should behave as follows
>>> d = train_classifier('example_tset.csv')
>>> d.keys()
dict_keys(['Indonesian', 'Icelandic', 'English'])
>>> type(d['English'])
<class 'collections.defaultdict'>
>>> d['English']['g t']
0.05794400216170997
Your code will be tested on a hidden training set which is much much larger than the example set. It contains 3331 documents from Wikipedias of 74 different languages. Consequently, the hidden test case might take a while to run.
3. Write a function score_document(document, lang_counts=default_lang_counts) which takes as input a document name as a string and a dictionary of dictionaries containing normalised language counts called lang_counts. It should return a dictionary of scores for each language in lang_counts, as obtained by performing a 'dot product' of trigram counts from the document with the normalised language counts. That is, it should multiply the trigram counts from the document with the trigram counts in lang_counts and add the whole lot up. If a trigram from the document is not in the dictionary for a given language, assume the count for the language as zero.
We have provided a stub of code which trains the classifier for you. We have also provided train_classifier(training_set) in a hidden library.
There are also two files included, visible in the tabs at top right. These are en_163083.txt, written in English, and de_1231811.txt, written in German, and can be loaded and used to test your function, which should behave as follows:
>>> test1 = 'en_163083.txt'
>>> d = score_document(test1)
>>> d['Vietnamese']
9.427325768357315
>>> max([(v, n) for (n, v) in d.items()])
(21.428216914833023, 'English')
>>> test2 = 'de_1231811.txt'
>>> d = score_document(test2)
>>> d['Polish']
7.710346556417009
>>> max([(v,n) for (n, v) in d.items()])
(53.12937809633241, 'German')
4. We are finally ready to put all the pieces together! We can now measure documents, train our classifier, and score documents per language. Write a function classify_doc(document, lang_counts=default_lang_counts) which takes a string document and a dictionary of normalised lang_counts, and returns a language based on the score of each language.
As before, we have provided a hidden implementation of score_document(document, lang_counts) in a hidden module (already imported) which takes a document and returns a dictionary of scores per language, as in the previous question. We have also provided a number of documents to play with.
Your function should return the language with the highest score. In the event of a tie it should return 'English' since the most common document in the training set is written in English, suggesting that if the document comes from the same source (Wikipedia), it is probably written in English. Obviously not a perfect assumption, but better than nothing given no information.
But how do we determine a tie? If the two top-ranking scores lie within 1e-10 of one another, then we shall say it's a tie (why do we do this, rather than testing equality directly?).
Your function should behave as follows:
>>> s = open('en_163083.txt').read()
>>> classify_doc(s)
'English'
>>> classify_doc('asdfhlj')
'Icelandic'
>>> s = open('pl_188313.txt').read()
>>> classify_doc(s)
'Polish'
>>> classify_doc('Hello Bob')
'Italian'
5. Write a function select_scores(langs, document, lang_counts=default_lang_counts) which takes a list of languages and returns the same list of languages, together with another list of the corresponding scores of the string document for those languages. Use your implementation of score_document(document, lang_counts=default_lang_counts).
We have provided a function plot_scores(languages, scores) which takes a list of languages and their corresponding scores and plots them on a histogram using the matplotlib module. Try this out with the documents provided in the tabs at right. Are the scores highest for the languages the documents are written in?
It is interesting to see which languages have similar scores; perhaps this says something about how similar the languages are, and whether they are related in some way (e.g. perhaps they belong to neighbouring countries or evolved from the same language some time long ago).
Your function should behave as follows (results may differ in the final decimal points):
>>> select_scores(['English', 'Indonesian', 'Malay'],open('Indonesian1.txt').read())
(['English', 'Indonesian', 'Malay'], [2.3500733676922594, 12.076336401029348, 12.096929312152872])
>>> select_scores(['English','Indonesian','Malay'],open('English1.txt').read())
(['English', 'Indonesian', 'Malay'], [21.428216914833023
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。