联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-04-15 11:25

COSC431 Assignment 1: Write a Search Engine

PREAMBLE DUE DATE:16 April 2021

In the first part of this assignment you will write a parser. In the second you will write an indexer, and in the third

you will write the search engine. Marks will be given for parts one and three only. If your search engine works

then your indexer must work. This assignment is worth 20% of your final mark.

The files you need for this assignment are located in ~andrew/COSC431. The XML file you are to use is

wsj.xml, it is the TREC Wall Street Journal collection. Your search engine must run in less than one second

per query on the machine we mark it on (including startup, input, output, and shutdown).

PART 1: PARSER 10%

Your program will parse the Wall Street Journal collection of documents marked up in XML. They look roughly

like this:

<DOC>

<DOCNO>WSJ920102-0154</DOCNO>

<DOCID>920102-0154.</DOCID>

<HL>Notice to Readers</HL>

<DATE>01/02/92</DATE>

<SO>WALL STREET JOURNAL (J), PAGE 1</SO>

<LP>

<P>Today's Wall Street Journal appears in two sections, the

second of which is the year-end review of markets.</P>

<P>Rhythm &amp; Blues, Inc. filed for bankruptcy today.</P>

</LP>

</DOC>

Your program will output a stream of tokens suitable for you indexer, one per line with a blank line between each

document. I do not recommend using a standard XML parser as the file is “real world” (i.e. not well formed).

PART 2: INDEXER

Build an inverted-file index of the Wall Street Journal collection. It is obvious to build this on top of your parser

from part 1 of this assignment. You should write the index to disk in whatever format you consider appropriate.

PART 3: SEARCH ENGINE 10%

Your search program will read a query from stdin and produce results on stdout. If it does not then you will

not get any marks for this part of the assignment (we use programs to mark your programs).

Each query will consist of a set of one or more search terms. You are not expected to support phrase, adjacency

or proximity searching – in other words, a document-level index is sufficient. You are not expected to support

wildcard searching or regular expression searching.

For the query, your program will output a separate line for each relevant document. Each line is in 2 columns:

<docno> <rsv>

where <docno> is the contents of the <DOCNO> element in the Wall Street Journal collection and <rsv> is the

retrieval status value (relevance score) that your program has assigned to the document. The output must be sorted

from highest to lowest rsv. You must output all documents that you think are relevant (not just the top-10).

Your program will be tested on your indexes, but using queries you have not seen.

HOW TO SUBMIT YOUR ASSIGNMENT

Submit your program by emailing me a github.com (or equivalent) URL. I will mark the most recent commit.

You will include a README that tells me where, on the department file server, I can find your index; and how

to compile and run your programs. You must NOT include the wsj.xml or your indexes in your github archive.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp