Search Engine
Homework #5
DATA 1050 Fall 2020
Due 23:59 EST, Wednesday, October 28, 2020
Optional: Coding Problem 3 Resubmission due by 23:59 EST, Monday, October 19, 2020
Setup & Submission
1. Use this GitHub Assignment Invitation Link: https://classroom.github.com/a/Vbl7I9d9
2. Run pip install --user conda install pytest pytest-timeout flake8 black nltk in a terminal to install all required
packages.
3. Push to master as much as you want before the homework deadline.
4. If your final push will be late, remember to use the Late Day Form *twice*, once before the deadline, and once
again after your final push.
5. When you are ready to submit, visit gradescope and submit your repo. Your report should be included in the repo
as report.md
6. Be sure to also Include a copy and paste of the test results and coverage information in coverage.md in each of
your stencil directories. Two students created a helpful utility script to do this for you.
7. For Coding Problem 3 Resubmission, please only include the coding problems you want to be
re-graded in a stencil_CodingProblems3 subdirectory of your hw5 repo and submit the entire repo via
the corresponding gradescope assignment “Coding Problems3 Resubmission”
Overview
Search engines retrieve a list of documents (typically web pages) that match a given query. Since multiple documents
may satisfy the query, most search engines also attempt to list more relevant documents first. In this project, you will use
the TF-IDF and PageRank algorithms to create part of a web search engine.
Operationally, each searchable document must be analyzed in advance (in a process called indexing) to gather and
integrate statistics about its content. This data is used to form an index (like the one found at the end of a book) that
allows user search queries to be performed in real-time without the need to revisit all of the original documents.
Document term count information is enough to perform simple query matching since it allows one to find all documents
that contain a particular set of terms. It also allows one rank documents relative to one another based on their term
frequencies. TF-IDF, otherwise known as term frequency–inverse document frequency, is one way to do this.
Web search engines use “spiders” to “crawl” to the web in order to automatically visit and index web pages. At each page,
a count of the number of times each word (“term”) appears in a document is recorded (i.e, the term histogram is
recorded), a list of outbound links (outlinks) is collected, and then the spider moves on.
When present, link information defines a document graph, which can be used for document ranking. For example,
documents that are referred to many other documents may be more important than ones with very few incoming links
(inlinks). Today, Google and most other web search engines use PageRank on link data to help determine web page
performance.
The PageRank algorithm was created by Larry Page and Sergey Brin (the founders of Google) while they were still
graduate students at Stanford. While Page’s original technical report suggested using it to rank web page importance
based on link information, it has since been used to analyze many other networks. Identifying social media influencers
(based on follower networks), and gauging the importance of scientific papers (via citation networks) are two examples.
A Note on Vocabulary
In this project handout,
● A “document” refers to a Wikipedia page
● A “term” refers to a word (or alphanumeric sequence containing no white space)
Details
We used the Wikipedia download interface to download some of its web pages. These documents (web pages) are
stored as XML records. In order to create your search engine, you will write code to process these XML files.
In the world of search and natural language processing, a collection of texts is called a corpus. A corpus (plural corpora) is
a collection of written texts (web pages with links in this case). Thus the corpora you will be working with are subsets of
the pages of the English-language Wikipedia.
This project has four main parts, which are described in detail in the sections below:
1. Calculating TF-IDF
2. Calculating PageRanks
3. Processing the XML files
4. Search Engine and user interface
Note: We suggest you implement your solution project in the sequence listed above. TF-IDF and PageRank are both
somewhat math-heavy, and it is important you get them working with artificial data for which you are sure of the answers
before you start processing real data. Specifically, this data will be used to create analytic test cases, some of which we
provide, and some of which you will provide. We encourage you to make git commits once a subcomponent is working.
This makes it easier for you to revert to a previous version.
Proceeding in this order will also help the course staff know which parts of the problem you have completed if you have
questions.
Stencil files provided:
● TFIDF.py, PageRank.py: Task: Implement of TF-IDF/PageRank algorithm.
● processing.py: Task: index the Wikipedia corpus
● SearchEngine.py: Task: Implement an API that takes search queries and returns search results ranked by the
chosen algorithm
● app.py: A simple command-line interface (CLI) we supply that will allow you to use your search engine for
interactively
● report.md: Task: Write your report here
Datasets provided:
(note that all datasets have been zipped together into wiki_data.zip. You’ll have to extract them into your stencil
directory before you can proceed)
● MiniWiki.xml, SmallWiki.xml: XML files with 2 and 107 Wikipedia articles respectively. As you work on
adding functionary to your search engine, we suggest you start by testing with these!
● MedWiki.xml: An XML containing 871 Wikipedia articles. You can try testing with this file once you get your code
working with SmallWiki.
● PageWiki.xml: An XML file containing 100 documents with titles 1 to 100. Each document contains nothing
except for a link to document 100. You can use this file to test your implementation of PageRank. Document 100
should have a much higher PageRank than the other documents.
Using VSCode
Jupyter notebooks are often used for visualization and analysis, and they work well for one-time coding tasks. In this
project and more generally in DATA 1050, we will not be coding in Jupyter notebooks but rather, in Python files. To help
with this we recommend you try using the Visual Studio Code Interactive Development Environment (IDE) with the
Microsoft Python Development extension.
Please read the VS Code Instructions document to learn how to do the following:
1. How to start The VS Code IDE on data.jupyter.brown.edu
2. Install the Python Development Extension
3. Install and configure other helpful packages and extensions
4. Run the debugger
VS Code is an immensely powerful software development tool. Learning how to use its many features takes a long time.
However, because it also provides integrated terminal window support, you can continue to use all of the command line
based python development skills you already have.
Since this is the first time we have run VS Code, we will update the VS Code Instructions as needed. We suggest you
add notifications for changes to the VS Code Instructions document.
Coding Style Requirements
Using a consistent coding style is a critical part of code quality. For this project, make your style consistent with PEP 8,
the style guide for python developers from Guido himself. There are many tools for checking if your code is styled
correctly. You can periodically run flake8 *.py in a terminal under your code directory. You can also install a flake8
extension to give you PEP-8 style warning as you code. Either way, resolve all style warnings you see to avoid receiving
point deductions.
An auto-formatting tool may also come in handy. From the terminal you can run black -l100 <name of Python file> to
automatically reformat your code (e.g., running black -l100 PageRank.py will reformat the code in your PageRank.py
file.) Various VS Code extensions provide similar functionality and allow you to select regions of text that to reformat.
Part 5.1 TF-IDF
As the name suggests, TF-IDF has two parts: TF (term frequency) and IDF (inverse document frequency). TF-IDF allows
us to capture the following two ideas:
● A term that appears many times within a document is likely to be more relevant to that document than a term that
appears fewer times (this is captured by TF). we calculate the term frequency as
TF(term, doc) = (term|doc)
total # of terms in the document
# of times term appears in the document = P
● A term that occurs in fewer documents in the corpus is likely to be a better signal of relevance compared to a
common term like “the”. (this is captured by IDF) A rare term often tells more about the document. The inverse
document frequency of a term is defined as
IDF(term) = log( )
# of documents in the corpus
# of documents in which the term appears
Please use natural log (which is what Python’s math.log does),
● TFIDF(term, doc) = TF(term, doc) × IDF(term)
Examples
Consider a document containing 100 terms wherein the term cat appears 3 times. The term frequency (i.e.,
tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the term cat appears
in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as ln(10,000,000 /
1,000) = 9.210. Thus, the tf-idf weight for the term cat in this document is the product of these quantities:
0.03 * 9.210 = 0.2763.
You may also find a more sophisticated example here.
Complete the following tasks in TFIDF.py.
Task 5.1
● calc_term_frequency: For each document and each term, calculate term frequency. Store the results in a
dictionary that maps from each term to a secondary dict. The secondary dict maps document title doc to
TF(term, doc). Indexing looks like this: [term][document that contains the term][frequency of that term in each
document that contains it] Pay close attention to the docstring. See the provided function test_tfidf_basic
for example inputs and outputs.
● calc_tf_idf: Iterate through the dictionary created for TF and multiply each term frequency by the term’s
inverse document frequency. Note that the returned dictionary can also serve as a “reverse index,” (i.e., given a
term, you can use a single dictionary lookup to get the documents that contain this term)
● search: Given a list of keywords, return a list of matching documents paired with their tf-idf scores. Each
returned document should contain at least 1 of the keywords. The returned list should be sorted by document
score. When a limit is specified, only return that many (doc, tf-idf-score) pairs (e.g., if limit = 10, you
would only return the top 10 matches). Compute tf-idf score by summing the tf-idf values of each query word
found in the document.
score(doc, keywords) = ∑ FIDF(term, doc)
term∈keywords
T
● Include your pytest compatible tests in TFIDF.py.
Notes
● Read carefully through the docstring (prose, parameter description). The docstrings most precisely specify the
behavior of each method.
● Use test_tfidf_basic as a starting point to help you understand the expected input and output.
● Some methods are labeled as @staticmethod. This means they do not depend on instance or class data,
notice they do not include a self variable. @staticmethods can be invoked without a class instance as illustrated
in test_tfidf_basic. Their output does not depend on instance or class variables (i.e. they are regular
stateless functions). This is sometimes helpful because each method can be tested separately. (For interested
readers, see here for a discussion on the difference between @classmethods (which can depend on class
variables) and @staticmethods.)
Report Q&A: Ordering and Logs
Hw5.1 Your boss has been reviewing your code and wants you to do the following:
“store word frequencies the way you compute”, i.e. s/he wants you to use a dict of dicts which is indexed via
{document:{word:word frequency}}
Think carefully about this suggestion and give an appropriate response.
Part 5.2 PageRank
So far we have scored documents based on their relevance to a given query only. This is how early search engines like
AltaVista worked. But then, in the late 90s, the founders of Google came up with an idea for ranking documents based on
their position in the web itself. Their PageRank algorithm gives an overall importance score and it pays no attention
whatsoever to the documents' content. Rather, the rankings it computes are based entirely on the link structure between
documents: i.e., which documents link to one another and which documents link to it. Documents with high scores (i.e.,
high PageRanks) have links from other documents that also have high ranks (which are linked to other documents with
high ranks). This is a kind of crazy circular idea, but when converted into the appropriate mathematics the algorithm works
quite well.
There are many variants of the PageRank algorithm. In this project, we will implement one of the most basic. Key
principles underlying the design of PageRank:
1. The more documents that link to a page, the more authoritative it should be.
○ For example, if my personal blog post on “Providence Internet Outages" is linked to by 5 different web
pages (5 different incoming links, aka 5 inlinks), and the Wikipedia document on "Providence, Rhode
Island" is linked to by 1000s of web pages (1000s of inlinks), then it makes sense to consider the
Wikipedia document more authoritative.
○ Many early search engines use incoming page counts, and at first, this worked well, but over time people
realized they could promote a page by making many fake pages that pointed to it.
○ The PageRank innovation was to also consider the authority score of incoming pages. Authoritative
pages should also be linked to by other authoritative pages.,
2. The ability of an authoritative page to promote other pages should be constrained by the total number of outgoing
links it has, fewer links provide better references.
○ Assume "Research at Brown" is linked to by an NYTimes document which links to only 10 other pages,
and my blog "About Me" page is linked to by "Rhode Island Technology Bloggers", which links to 2000
other pages. Because the "NYTimes" page has only a few links, and "Rhode Island Blogger” has many
links, a link from the "NYTimes" page lends a larger fraction of its importance than a link from the "Rhode
Island Technology Bloggers", leading us to attribute more authority to "Research at Brown" than "About
Me"
Let u be a document, Bu be the set of documents that have a link to u , L(u) be the number of outgoing edges from u .
The PageRank score for document u is defined as
PageRank(u) = ∑
v∈Bu
L(v)
PageRank(v)
The trouble with this equation is that it is circular; in order to calculate the weight of document u , we have to know the
weights of all document linking to u . How do we begin calculating the document weights? In practice, we use an iterative
approach and stop once the document weights have converged. Here is a suggested approach to implementing
PageRank:
● Create a dict weights. Initialize weights for all documents to 1/N, where N is the number of documents in
the corpus
● While maximum # of iterations not reached
○ Create a dict new_weights with weights for all documents set to 0
○ Iterate through each document x.
For each document y that document x links to, update
new_weights[y] += weights[x] / (# of documents x links to)
○ Calculate L2-distance between weights and new_weights.
○ Break out of the loop if distance < eps * N
○ Set weights = new_weights
Note:
● L2-distance is Euclidean distance
● If you sum up the PageRank value for each document in a corpus, you should get 1.
● Links from a document to itself are ignored. Links to documents outside the corpus are ignored. You may
need to filter such links.
● Documents that link to nothing (i.e. sinks, like B below) are considered linked (once) to all documents including
itself. Therefore, each document receives (weight of this sink) / N.
● Multiple links from one document to another are treated as a single link.
Examples
Let’s say that we have the corpus shown above, which consists of the documents (a.k.a. pages), A, B, and C. A
links to B and C, and C links to A. The table below shows the results of running the PageRank algorithm on this
corpus in order to determine the weight (a.k.a. rank) of each document. Remember that since B does not link to
anything, we treat it as if it links to every document including itself.
Iteration Weight of A Weight of B Weight of C
Formula B / 3 + C / 1 A / 2 + B / 3 A / 2 + B / 3
Initial 1/3 = .333 1/3 = .333 1/3 = .333
1 .333 / 3 + .333 / 1= .444 .333 / 2 + .333 / 3 = .277 .333 / 2 + .333 / 3= .277
2 .277 / 3 + .277 / 1 = .370 .444 / 2 + .277 / 3 = .314 .444 / 2 + .277 / 3 = .314
3 .314 / 3 + .314 / 1 = .419 .370 / 2 + .314 / 3 = .290 .370 / 2 + .314 / 3 = .290
4 .290 / 3 + .290 / 1 = .386 .419 / 2 + .290 / 3 = .306 .419 / 2 + .209 / 3 = .306
Task 5.2
● calc_page_rank: Given links between the documents, calculate the PageRank values for each page as
specified above.
● search: Searches for documents containing at least one of the query words, sorted by PageRank values and
return up to `limit` of them.
● Write unit tests for PageRank.py.
● Use test_page_rank_basic as a starting point to help you understand the expected input and output.
● Add file docstring and class docstring in PageRank.py and remove all TODOs.
Report Q&A: Page Rank Understanding
Hw5.2 There is a Markov Chain interpretation of the PageRank algorithm that PageRank value represents the frequency
that web surfers arrive at a specific web page following hyperlinks found on other pages.
Included in the model is the idea that a user can also visit a page without clicking a link (e.g., they can visit via bookmark
or a typed url). This random jumping process around process can be included in an iterative solution via a damping factor
d. Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node X, and L(X)
is the number of outbound links from X. (i.e., What does d represent, what does (1 - d)/N represent, and what does the
sum represent.)
If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?
See https://en.wikipedia.org/wiki/PageRank if you are stuck.
Part 5.3 Processing the XML file
Tf-idf computes word counts. In order to do this, irrelevant data (like punctuation and word spelling variations) must be
removed. Tf-idf calculation is then done with respect to this simplified corpus. PageRank requires all the links in the
corpus.
It is now time to process the Wikipedia dump XML to extract each type of data.
Introduction to XML files
The Wikipedia corpora used for this project are stored as XML files. XML stands for eXtensible Markup Language. It is
similar to HTML in that elements are surrounded by matching opening and closing tags. An opening tag looks like
<title>. The corresponding closing tag includes a slash, so it looks like <\title>.
Open SmallWiki.xml in a text editor and get familiar with the structure of XML files.
Task 5.3.1 Processing Text
You should work with process_text in processing.py.
● Tokenizing
Tokenization refers to the process of splitting a string into tokens. For our purposes, tokens are sequences of characters
with no punctuation or whitespace. See the docstring for more guidance.
Ex: `Are you there?` should become `Are`, `you`, `there`
● Stop Words
After tokenizing, you will need to remove stop words. Stop words are words that appear too often to be meaningful for
searching (e.g. "the", "of"). In building a search engine, it makes sense to discard stop words. Say you search the cat
in the hat, the shorter query cat hat gets at the essence of the longer query. The Python Natural Language Toolkit
NLTK package includes a list of typical English stopwords that you should filter out.
● Stemming
The next issue to overcome when building a search engine is we want to calculate a count for the stem of each word in
each document as opposed to each of its variants. walking, walked, and walks all essentially mean the same thing, so
we should reduce all of them down to walk. This is what stemming accomplishes. In the stencil code, we instance a
PorterStemmer for you. Note that its stem()method also lowercases all letters, which is desirable behavior.
Here is a suggested approach for the three steps mentioned above.
1. Use sent_tokenize and word_tokenize from nltk.tokenize for tokenizing. If the first execution of
these methods gives you an error, follow the printed instructions and fix it. Here are some examples.
2. Strip punctuations at both ends of each word, str.strip might be useful.
3. Stem words with the `stemmer` above.
4. Filter stop words and empty words. We’ve created a set of stopwords for you using the NLTK toolkit in the
_english_stop_words please filter using this variable. Doing so will let you (and us) test out different sets of
stop words.
Use test_processing_basic as a starting point to help you understand the expected input and output.
Task 5.3.2 Processing Links
The extract_links method is used for this part.
In the XML files you will be working with, links can take the following two forms:
1. [[Impeached Presidents]]: This is a normal link. Impeached Presidents is the text displayed on the page,
and if you click on it, you are directed to the Impeached Presidents Wikipedia page.
2. [[Washington|Presidents]]: The text that come before the | character is the text displayed on the page (we
will call this “display text”), but the page that is linked to is the text that comes after the |. In this case we would
see Washington displayed on the page, but when we click on it, we would be redirected to the Presidents
Wikipedia page.
○ Note: When calculating TF-IDF, Washington should be considered as a word on the page, but
Presidents should not.
○ Note: If a link contains more than one | (e.g. [[File:Essendon logo
2010.png|center|150px]]), then do not treat it as a link. The words should also not be counted as
words on the page.
To isolate links that appear in the text, it is helpful to use a regex (regular expression). The Python package that provides
regex functionality is called re, which has been imported at the top of the stencil code.
Here is a useful regex: r'\[\[[^\[]+?\]\]'
● Meaning: Match two left brackets (\[\[) and two right brackets (\]\]), making sure there's something in the
middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another
link was starting inside this link ([ˆ[]+?).
If you want to play around with regex functions, take a look at this page
Feel free to take a look at the documentation for more information!
The code has been given to you in this part so no coding is required for this part, however, we expect you to fully
understand it.
Report Q&A: extract_links
Hw5.3 In report.md, explain how extract_links operates and what it returns. In addition, What would be different if
we don’t return the second return value (words extracted from display texts)?
Task 5.3.3 Working with XML files in Python
You should work with build_corpus in this part.
You will notice that at the top of processing.py, there is a line that says from xml.etree import ElementTree.
This is a Python package that makes it easy to work with XML files. Take a look at the documentation to learn how to use
this package.
You should iterate through all the <page> nodes in the given XML file. We are interested in the <title> and <text>
of each page. Methods you may find useful are root.iter and node.find(). Apply process_text and
extract_links on the text (or the text concatenated with the title and the display text from links). The first return value
is a corpus(a list, each item is a tuple of document title and processed text) used as the input to TF-IDF. The second
return value is the links that will be used as input to PageRank.
Use test_processing_basic as a starting point to help you understand the expected input and output.
Part 5.4 SearchEngine: Optimizing Search Results
class SearchEngine (located in SearchEngine.py) allows users to query the corpus with various ranking schemes.
Command Line Interface (CLI)
Once you have implemented enough SearchEngine functionality you run python app.py to start an interactive
command-line interface for your search engine (your result will not necessarily match this screenshot). The commands for
use will be displayed when you run it. (The TAs will also use the CLI to evaluate your implementation as part of the
grading process).
Tasks
● Read through __init__ and search. Make sure you understand them.
● Implement the 'smart' search mode in smart_search. You should take both TF-IDF and PageRank into
account. This is where you can get creative!
Keep track of how you develop your smart search. You will need to explain your approach to implementing
the 'smart' search mode in report.md. You will also include 2 sample searches you performed (and their
results) that demonstrate that your approach returns reasonable results. In addition, include a screenshot in that
section of the report of you using the CLI to query your smart engine.
● Add file, class and method docstring in SearchEngine.py and remove all TODOs.
● Be sure that the Command Line Interface (described below) is working for you, since it will be used to also grade
your project.
Report Q&A: process_test
Hw5.4 In report.md, explain why process_text is also applied on the query string.
Report Q&A: Incremental Updates
Hw5.5 In report.md, Web spiders continuously find new web pages and revisit old web pages. When a new page is
added or an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you
might implement an incremental SearchEngine update() method.
Testing
Thoroughly testing your project should involve a combination of unit testing and integration testing. Use the type of testing
that fits the situation. Include tests that confirm the calculations computed by your search engine are correct and that the
other important pieces are functioning as you expect. Once you have a working search engine, also verify that search
results are reasonable. You should test various queries, keeping in mind that your search engine should handle all inputs
gracefully. We encourage you to create your own XML files for testing.
You will only receive credit for testing which is clearly documented in your report. Test coverage is not the only goal (but
you should still check coverage). Tell us about why we should be confident that your code survives any user inputs /
corpus.
Report
Write your project report in the file called report.md. Include the following under each section heading:
# Search Engine Report
## Overview
Overview of how your code works
## Known Bugs
Description and example inputs and outputs for any known bugs.
## Report Questions & Answers
Hw5.1 Respond to your boss’s suggestions to “store word frequencies the way you compute”. (See original
question above for more detail.)
Hw5.2 Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node
X, and L(X) is the number of outbound links from X. (i.e., What does d represent, what does (1 - d)/N
represent and, what does the sum represents.)
If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?
Hw5.3 Explain how extract_links operates and what it returns. In addition, What would be different if we
don’t return the second return value (words extracted from display texts)?
Hw5.4 Explain why process_text is also applied to query strings.
Hw5.5 Web spiders continuously find new web pages and revisit old web pages. When a new page is added or
an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you
might implement an incremental SearchEngine update() method.
## Smart Search
Description of how you implemented the 'smart' search mode. Include at least two examples that show that it
makes sense, and explains why it can potentially be better than just PageRank or TF-IDF.
## Testing
Description of how you tested your code, include details on what you did beyond running the basic tests included
with the stencil.
Include screenshot from using CLI.
Rubric
● Coding Style: 5%
○ Coding style is a critical part of code quality. Run flake8 *.py in a terminal under your code directory.
Resolve all style warnings to avoid point deduction.
● TF-IDF: 25%
● PageRank: 25%
● SearchEngine & CLI (with Smart Search): 15%
● Processing XML file: 10%
● Report: 10%
● Testing: 10%
Special thanks to CS18 for letting us borrow from their course materials.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。