联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-10-30 11:01

Search Engine

Homework #5

DATA 1050 Fall 2020

Due 23:59 EST, Wednesday, October 28, 2020

Optional: Coding Problem 3 Resubmission due by 23:59 EST, Monday, October 19, 2020

Setup & Submission

1. Use this GitHub Assignment Invitation Link: https://classroom.github.com/a/Vbl7I9d9

2. Run pip install --user conda install pytest pytest-timeout flake8 black nltk in a terminal to install all required

packages.

3. Push to master as much as you want before the homework deadline.

4. If your final push will be late, remember to use the Late Day Form *twice*, once before the deadline, and once

again after your final push.

5. When you are ready to submit, visit gradescope and submit your repo. Your report should be included in the repo

as report.md

6. Be sure to also Include a copy and paste of the test results and coverage information in coverage.md in each of

your stencil directories. Two students created a helpful utility script to do this for you.

7. For Coding Problem 3 Resubmission, please only include the coding problems you want to be

re-graded in a stencil_CodingProblems3 subdirectory of your hw5 repo and submit the entire repo via

the corresponding gradescope assignment “Coding Problems3 Resubmission”

Overview

Search engines retrieve a list of documents (typically web pages) that match a given query. Since multiple documents

may satisfy the query, most search engines also attempt to list more relevant documents first. In this project, you will use

the TF-IDF and PageRank algorithms to create part of a web search engine.

Operationally, each searchable document must be analyzed in advance (in a process called indexing) to gather and

integrate statistics about its content. This data is used to form an index (like the one found at the end of a book) that

allows user search queries to be performed in real-time without the need to revisit all of the original documents.

Document term count information is enough to perform simple query matching since it allows one to find all documents

that contain a particular set of terms. It also allows one rank documents relative to one another based on their term

frequencies. TF-IDF, otherwise known as term frequency–inverse document frequency, is one way to do this.

Web search engines use “spiders” to “crawl” to the web in order to automatically visit and index web pages. At each page,

a count of the number of times each word (“term”) appears in a document is recorded (i.e, the term histogram is

recorded), a list of outbound links (outlinks) is collected, and then the spider moves on.

When present, link information defines a document graph, which can be used for document ranking. For example,

documents that are referred to many other documents may be more important than ones with very few incoming links

(inlinks). Today, Google and most other web search engines use PageRank on link data to help determine web page

performance.

The PageRank algorithm was created by Larry Page and Sergey Brin (the founders of Google) while they were still

graduate students at Stanford. While Page’s original technical report suggested using it to rank web page importance

based on link information, it has since been used to analyze many other networks. Identifying social media influencers

(based on follower networks), and gauging the importance of scientific papers (via citation networks) are two examples.

A Note on Vocabulary

In this project handout,

● A “document” refers to a Wikipedia page

● A “term” refers to a word (or alphanumeric sequence containing no white space)

Details

We used the Wikipedia download interface to download some of its web pages. These documents (web pages) are

stored as XML records. In order to create your search engine, you will write code to process these XML files.

In the world of search and natural language processing, a collection of texts is called a corpus. A corpus (plural corpora) is

a collection of written texts (web pages with links in this case). Thus the corpora you will be working with are subsets of

the pages of the English-language Wikipedia.

This project has four main parts, which are described in detail in the sections below:

1. Calculating TF-IDF

2. Calculating PageRanks

3. Processing the XML files

4. Search Engine and user interface

Note: We suggest you implement your solution project in the sequence listed above. TF-IDF and PageRank are both

somewhat math-heavy, and it is important you get them working with artificial data for which you are sure of the answers

before you start processing real data. Specifically, this data will be used to create analytic test cases, some of which we

provide, and some of which you will provide. We encourage you to make git commits once a subcomponent is working.

This makes it easier for you to revert to a previous version.

Proceeding in this order will also help the course staff know which parts of the problem you have completed if you have

questions.

Stencil files provided:

● TFIDF.py, PageRank.py: Task: Implement of TF-IDF/PageRank algorithm.

● processing.py: Task: index the Wikipedia corpus

● SearchEngine.py: Task: Implement an API that takes search queries and returns search results ranked by the

chosen algorithm

● app.py: A simple command-line interface (CLI) we supply that will allow you to use your search engine for

interactively

● report.md: Task: Write your report here

Datasets provided:

(note that all datasets have been zipped together into wiki_data.zip. You’ll have to extract them into your stencil

directory before you can proceed)

● MiniWiki.xml, SmallWiki.xml: XML files with 2 and 107 Wikipedia articles respectively. As you work on

adding functionary to your search engine, we suggest you start by testing with these!

● MedWiki.xml: An XML containing 871 Wikipedia articles. You can try testing with this file once you get your code

working with SmallWiki.

● PageWiki.xml: An XML file containing 100 documents with titles 1 to 100. Each document contains nothing

except for a link to document 100. You can use this file to test your implementation of PageRank. Document 100

should have a much higher PageRank than the other documents.

Using VSCode

Jupyter notebooks are often used for visualization and analysis, and they work well for one-time coding tasks. In this

project and more generally in DATA 1050, we will not be coding in Jupyter notebooks but rather, in Python files. To help

with this we recommend you try using the Visual Studio Code Interactive Development Environment (IDE) with the

Microsoft Python Development extension.

Please read the VS Code Instructions document to learn how to do the following:

1. How to start The VS Code IDE on data.jupyter.brown.edu

2. Install the Python Development Extension

3. Install and configure other helpful packages and extensions

4. Run the debugger

VS Code is an immensely powerful software development tool. Learning how to use its many features takes a long time.

However, because it also provides integrated terminal window support, you can continue to use all of the command line

based python development skills you already have.

Since this is the first time we have run VS Code, we will update the VS Code Instructions as needed. We suggest you

add notifications for changes to the VS Code Instructions document.

Coding Style Requirements

Using a consistent coding style is a critical part of code quality. For this project, make your style consistent with PEP 8,

the style guide for python developers from Guido himself. There are many tools for checking if your code is styled

correctly. You can periodically run flake8 *.py in a terminal under your code directory. You can also install a flake8

extension to give you PEP-8 style warning as you code. Either way, resolve all style warnings you see to avoid receiving

point deductions.

An auto-formatting tool may also come in handy. From the terminal you can run black -l100 <name of Python file> to

automatically reformat your code (e.g., running black -l100 PageRank.py will reformat the code in your PageRank.py

file.) Various VS Code extensions provide similar functionality and allow you to select regions of text that to reformat.

Part 5.1 TF-IDF

As the name suggests, TF-IDF has two parts: TF (term frequency) and IDF (inverse document frequency). TF-IDF allows

us to capture the following two ideas:

● A term that appears many times within a document is likely to be more relevant to that document than a term that

appears fewer times (this is captured by TF). we calculate the term frequency as

TF(term, doc) = (term|doc)

total # of terms in the document

# of times term appears in the document = P

● A term that occurs in fewer documents in the corpus is likely to be a better signal of relevance compared to a

common term like “the”. (this is captured by IDF) A rare term often tells more about the document. The inverse

document frequency of a term is defined as

IDF(term) = log( )

# of documents in the corpus

# of documents in which the term appears

Please use natural log (which is what Python’s math.log does),

● TFIDF(term, doc) = TF(term, doc) × IDF(term)

Examples

Consider a document containing 100 terms wherein the term cat appears 3 times. The term frequency (i.e.,

tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the term cat appears

in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as ln(10,000,000 /

1,000) = 9.210. Thus, the tf-idf weight for the term cat in this document is the product of these quantities:

0.03 * 9.210 = 0.2763.

You may also find a more sophisticated example here.

Complete the following tasks in TFIDF.py.

Task 5.1

● calc_term_frequency: For each document and each term, calculate term frequency. Store the results in a

dictionary that maps from each term to a secondary dict. The secondary dict maps document title doc to

TF(term, doc). Indexing looks like this: [term][document that contains the term][frequency of that term in each

document that contains it] Pay close attention to the docstring. See the provided function test_tfidf_basic

for example inputs and outputs.

● calc_tf_idf: Iterate through the dictionary created for TF and multiply each term frequency by the term’s

inverse document frequency. Note that the returned dictionary can also serve as a “reverse index,” (i.e., given a

term, you can use a single dictionary lookup to get the documents that contain this term)

● search: Given a list of keywords, return a list of matching documents paired with their tf-idf scores. Each

returned document should contain at least 1 of the keywords. The returned list should be sorted by document

score. When a limit is specified, only return that many (doc, tf-idf-score) pairs (e.g., if limit = 10, you

would only return the top 10 matches). Compute tf-idf score by summing the tf-idf values of each query word

found in the document.

score(doc, keywords) = ∑ FIDF(term, doc)

term∈keywords

T

● Include your pytest compatible tests in TFIDF.py.

Notes

● Read carefully through the docstring (prose, parameter description). The docstrings most precisely specify the

behavior of each method.

● Use test_tfidf_basic as a starting point to help you understand the expected input and output.

● Some methods are labeled as @staticmethod. This means they do not depend on instance or class data,

notice they do not include a self variable. @staticmethods can be invoked without a class instance as illustrated

in test_tfidf_basic. Their output does not depend on instance or class variables (i.e. they are regular

stateless functions). This is sometimes helpful because each method can be tested separately. (For interested

readers, see here for a discussion on the difference between @classmethods (which can depend on class

variables) and @staticmethods.)

Report Q&A: Ordering and Logs

Hw5.1 Your boss has been reviewing your code and wants you to do the following:

“store word frequencies the way you compute”, i.e. s/he wants you to use a dict of dicts which is indexed via

{document:{word:word frequency}}

Think carefully about this suggestion and give an appropriate response.

Part 5.2 PageRank

So far we have scored documents based on their relevance to a given query only. This is how early search engines like

AltaVista worked. But then, in the late 90s, the founders of Google came up with an idea for ranking documents based on

their position in the web itself. Their PageRank algorithm gives an overall importance score and it pays no attention

whatsoever to the documents' content. Rather, the rankings it computes are based entirely on the link structure between

documents: i.e., which documents link to one another and which documents link to it. Documents with high scores (i.e.,

high PageRanks) have links from other documents that also have high ranks (which are linked to other documents with

high ranks). This is a kind of crazy circular idea, but when converted into the appropriate mathematics the algorithm works

quite well.

There are many variants of the PageRank algorithm. In this project, we will implement one of the most basic. Key

principles underlying the design of PageRank:

1. The more documents that link to a page, the more authoritative it should be.

○ For example, if my personal blog post on “Providence Internet Outages" is linked to by 5 different web

pages (5 different incoming links, aka 5 inlinks), and the Wikipedia document on "Providence, Rhode

Island" is linked to by 1000s of web pages (1000s of inlinks), then it makes sense to consider the

Wikipedia document more authoritative.

○ Many early search engines use incoming page counts, and at first, this worked well, but over time people

realized they could promote a page by making many fake pages that pointed to it.

○ The PageRank innovation was to also consider the authority score of incoming pages. Authoritative

pages should also be linked to by other authoritative pages.,

2. The ability of an authoritative page to promote other pages should be constrained by the total number of outgoing

links it has, fewer links provide better references.

○ Assume "Research at Brown" is linked to by an NYTimes document which links to only 10 other pages,

and my blog "About Me" page is linked to by "Rhode Island Technology Bloggers", which links to 2000

other pages. Because the "NYTimes" page has only a few links, and "Rhode Island Blogger” has many

links, a link from the "NYTimes" page lends a larger fraction of its importance than a link from the "Rhode

Island Technology Bloggers", leading us to attribute more authority to "Research at Brown" than "About

Me"

Let u be a document, Bu be the set of documents that have a link to u , L(u) be the number of outgoing edges from u .

The PageRank score for document u is defined as

PageRank(u) = ∑

v∈Bu

L(v)

PageRank(v)

The trouble with this equation is that it is circular; in order to calculate the weight of document u , we have to know the

weights of all document linking to u . How do we begin calculating the document weights? In practice, we use an iterative

approach and stop once the document weights have converged. Here is a suggested approach to implementing

PageRank:

● Create a dict weights. Initialize weights for all documents to 1/N, where N is the number of documents in

the corpus

● While maximum # of iterations not reached

○ Create a dict new_weights with weights for all documents set to 0

○ Iterate through each document x.

For each document y that document x links to, update

new_weights[y] += weights[x] / (# of documents x links to)

○ Calculate L2-distance between weights and new_weights.

○ Break out of the loop if distance < eps * N

○ Set weights = new_weights

Note:

● L2-distance is Euclidean distance

● If you sum up the PageRank value for each document in a corpus, you should get 1.

● Links from a document to itself are ignored. Links to documents outside the corpus are ignored. You may

need to filter such links.

● Documents that link to nothing (i.e. sinks, like B below) are considered linked (once) to all documents including

itself. Therefore, each document receives (weight of this sink) / N.

● Multiple links from one document to another are treated as a single link.

Examples

Let’s say that we have the corpus shown above, which consists of the documents (a.k.a. pages), A, B, and C. A

links to B and C, and C links to A. The table below shows the results of running the PageRank algorithm on this

corpus in order to determine the weight (a.k.a. rank) of each document. Remember that since B does not link to

anything, we treat it as if it links to every document including itself.

Iteration Weight of A Weight of B Weight of C

Formula B / 3 + C / 1 A / 2 + B / 3 A / 2 + B / 3

Initial 1/3 = .333 1/3 = .333 1/3 = .333

1 .333 / 3 + .333 / 1= .444 .333 / 2 + .333 / 3 = .277 .333 / 2 + .333 / 3= .277

2 .277 / 3 + .277 / 1 = .370 .444 / 2 + .277 / 3 = .314 .444 / 2 + .277 / 3 = .314

3 .314 / 3 + .314 / 1 = .419 .370 / 2 + .314 / 3 = .290 .370 / 2 + .314 / 3 = .290

4 .290 / 3 + .290 / 1 = .386 .419 / 2 + .290 / 3 = .306 .419 / 2 + .209 / 3 = .306

Task 5.2

● calc_page_rank: Given links between the documents, calculate the PageRank values for each page as

specified above.

● search: Searches for documents containing at least one of the query words, sorted by PageRank values and

return up to `limit` of them.

● Write unit tests for PageRank.py.

● Use test_page_rank_basic as a starting point to help you understand the expected input and output.

● Add file docstring and class docstring in PageRank.py and remove all TODOs.

Report Q&A: Page Rank Understanding

Hw5.2 There is a Markov Chain interpretation of the PageRank algorithm that PageRank value represents the frequency

that web surfers arrive at a specific web page following hyperlinks found on other pages.

Included in the model is the idea that a user can also visit a page without clicking a link (e.g., they can visit via bookmark

or a typed url). This random jumping process around process can be included in an iterative solution via a damping factor

d. Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node X, and L(X)

is the number of outbound links from X. (i.e., What does d represent, what does (1 - d)/N represent, and what does the

sum represent.)

If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?

See https://en.wikipedia.org/wiki/PageRank if you are stuck.

Part 5.3 Processing the XML file

Tf-idf computes word counts. In order to do this, irrelevant data (like punctuation and word spelling variations) must be

removed. Tf-idf calculation is then done with respect to this simplified corpus. PageRank requires all the links in the

corpus.

It is now time to process the Wikipedia dump XML to extract each type of data.

Introduction to XML files

The Wikipedia corpora used for this project are stored as XML files. XML stands for eXtensible Markup Language. It is

similar to HTML in that elements are surrounded by matching opening and closing tags. An opening tag looks like

<title>. The corresponding closing tag includes a slash, so it looks like <\title>.

Open SmallWiki.xml in a text editor and get familiar with the structure of XML files.

Task 5.3.1 Processing Text

You should work with process_text in processing.py.

● Tokenizing

Tokenization refers to the process of splitting a string into tokens. For our purposes, tokens are sequences of characters

with no punctuation or whitespace. See the docstring for more guidance.

Ex: `Are you there?` should become `Are`, `you`, `there`

● Stop Words

After tokenizing, you will need to remove stop words. Stop words are words that appear too often to be meaningful for

searching (e.g. "the", "of"). In building a search engine, it makes sense to discard stop words. Say you search the cat

in the hat, the shorter query cat hat gets at the essence of the longer query. The Python Natural Language Toolkit

NLTK package includes a list of typical English stopwords that you should filter out.

● Stemming

The next issue to overcome when building a search engine is we want to calculate a count for the stem of each word in

each document as opposed to each of its variants. walking, walked, and walks all essentially mean the same thing, so

we should reduce all of them down to walk. This is what stemming accomplishes. In the stencil code, we instance a

PorterStemmer for you. Note that its stem()method also lowercases all letters, which is desirable behavior.

Here is a suggested approach for the three steps mentioned above.

1. Use sent_tokenize and word_tokenize from nltk.tokenize for tokenizing. If the first execution of

these methods gives you an error, follow the printed instructions and fix it. Here are some examples.

2. Strip punctuations at both ends of each word, str.strip might be useful.

3. Stem words with the `stemmer` above.

4. Filter stop words and empty words. We’ve created a set of stopwords for you using the NLTK toolkit in the

_english_stop_words please filter using this variable. Doing so will let you (and us) test out different sets of

stop words.

Use test_processing_basic as a starting point to help you understand the expected input and output.

Task 5.3.2 Processing Links

The extract_links method is used for this part.

In the XML files you will be working with, links can take the following two forms:

1. [[Impeached Presidents]]: This is a normal link. Impeached Presidents is the text displayed on the page,

and if you click on it, you are directed to the Impeached Presidents Wikipedia page.

2. [[Washington|Presidents]]: The text that come before the | character is the text displayed on the page (we

will call this “display text”), but the page that is linked to is the text that comes after the |. In this case we would

see Washington displayed on the page, but when we click on it, we would be redirected to the Presidents

Wikipedia page.

○ Note: When calculating TF-IDF, Washington should be considered as a word on the page, but

Presidents should not.

○ Note: If a link contains more than one | (e.g. [[File:Essendon logo

2010.png|center|150px]]), then do not treat it as a link. The words should also not be counted as

words on the page.

To isolate links that appear in the text, it is helpful to use a regex (regular expression). The Python package that provides

regex functionality is called re, which has been imported at the top of the stencil code.

Here is a useful regex: r'\[\[[^\[]+?\]\]'

● Meaning: Match two left brackets (\[\[) and two right brackets (\]\]), making sure there's something in the

middle, but also making sure there is not a left bracket in the middle, which would mean that somehow another

link was starting inside this link ([?[]+?).

If you want to play around with regex functions, take a look at this page

Feel free to take a look at the documentation for more information!

The code has been given to you in this part so no coding is required for this part, however, we expect you to fully

understand it.

Report Q&A: extract_links

Hw5.3 In report.md, explain how extract_links operates and what it returns. In addition, What would be different if

we don’t return the second return value (words extracted from display texts)?

Task 5.3.3 Working with XML files in Python

You should work with build_corpus in this part.

You will notice that at the top of processing.py, there is a line that says from xml.etree import ElementTree.

This is a Python package that makes it easy to work with XML files. Take a look at the documentation to learn how to use

this package.

You should iterate through all the <page> nodes in the given XML file. We are interested in the <title> and <text>

of each page. Methods you may find useful are root.iter and node.find(). Apply process_text and

extract_links on the text (or the text concatenated with the title and the display text from links). The first return value

is a corpus(a list, each item is a tuple of document title and processed text) used as the input to TF-IDF. The second

return value is the links that will be used as input to PageRank.

Use test_processing_basic as a starting point to help you understand the expected input and output.

Part 5.4 SearchEngine: Optimizing Search Results

class SearchEngine (located in SearchEngine.py) allows users to query the corpus with various ranking schemes.

Command Line Interface (CLI)

Once you have implemented enough SearchEngine functionality you run python app.py to start an interactive

command-line interface for your search engine (your result will not necessarily match this screenshot). The commands for

use will be displayed when you run it. (The TAs will also use the CLI to evaluate your implementation as part of the

grading process).

Tasks

● Read through __init__ and search. Make sure you understand them.

● Implement the 'smart' search mode in smart_search. You should take both TF-IDF and PageRank into

account. This is where you can get creative!

Keep track of how you develop your smart search. You will need to explain your approach to implementing

the 'smart' search mode in report.md. You will also include 2 sample searches you performed (and their

results) that demonstrate that your approach returns reasonable results. In addition, include a screenshot in that

section of the report of you using the CLI to query your smart engine.

● Add file, class and method docstring in SearchEngine.py and remove all TODOs.

● Be sure that the Command Line Interface (described below) is working for you, since it will be used to also grade

your project.

Report Q&A: process_test

Hw5.4 In report.md, explain why process_text is also applied on the query string.

Report Q&A: Incremental Updates

Hw5.5 In report.md, Web spiders continuously find new web pages and revisit old web pages. When a new page is

added or an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you

might implement an incremental SearchEngine update() method.

Testing

Thoroughly testing your project should involve a combination of unit testing and integration testing. Use the type of testing

that fits the situation. Include tests that confirm the calculations computed by your search engine are correct and that the

other important pieces are functioning as you expect. Once you have a working search engine, also verify that search

results are reasonable. You should test various queries, keeping in mind that your search engine should handle all inputs

gracefully. We encourage you to create your own XML files for testing.

You will only receive credit for testing which is clearly documented in your report. Test coverage is not the only goal (but

you should still check coverage). Tell us about why we should be confident that your code survives any user inputs /

corpus.

Report

Write your project report in the file called report.md. Include the following under each section heading:

# Search Engine Report

## Overview

Overview of how your code works

## Known Bugs

Description and example inputs and outputs for any known bugs.

## Report Questions & Answers

Hw5.1 Respond to your boss’s suggestions to “store word frequencies the way you compute”. (See original

question above for more detail.)

Hw5.2 Explain the following equation in terms of probability notation if PR(X) is the probability of arriving at node

X, and L(X) is the number of outbound links from X. (i.e., What does d represent, what does (1 - d)/N

represent and, what does the sum represents.)

If 32% of traffic to a webpage comes from non-inlink sources what should the value of d be?

Hw5.3 Explain how extract_links operates and what it returns. In addition, What would be different if we

don’t return the second return value (words extracted from display texts)?

Hw5.4 Explain why process_text is also applied to query strings.

Hw5.5 Web spiders continuously find new web pages and revisit old web pages. When a new page is added or

an old page is revisited an incremental update to the TF-IDF and PageRank must be made. Explain how you

might implement an incremental SearchEngine update() method.

## Smart Search

Description of how you implemented the 'smart' search mode. Include at least two examples that show that it

makes sense, and explains why it can potentially be better than just PageRank or TF-IDF.

## Testing

Description of how you tested your code, include details on what you did beyond running the basic tests included

with the stencil.

Include screenshot from using CLI.

Rubric

● Coding Style: 5%

○ Coding style is a critical part of code quality. Run flake8 *.py in a terminal under your code directory.

Resolve all style warnings to avoid point deduction.

● TF-IDF: 25%

● PageRank: 25%

● SearchEngine & CLI (with Smart Search): 15%

● Processing XML file: 10%

● Report: 10%

● Testing: 10%

Special thanks to CS18 for letting us borrow from their course materials.


版权所有:编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。