FIT5166Information Retrieval Systems
Practical Assignment - Semester2 2018
(18% of Total Marks)
Your task is towrite an information retrieval engine,which will be able to index a collection of documents, and in response toa keyword query,retrieve matchingdocuments.The information retrieval model your program will use is the vector-space model.
You must followall ofthe instructions below:
SEPARATE SUBMISSIONS ARE REQUIRED FOR THE CREDIT LEVELASSIGNMENT
AND THE HIGH-DISTINCTION LEVEL ASSIGNMENT (IF ATTEMPTING THE HIGHD IS TINCTION LEVEL).
I.INSTRUCTIONS FOR THE CREDIT LEVEL ASSIGNMENT
(MAXIMUM MARK 69%)
1.Your program can be written in Java, Python or any other programminglanguage of your choice. Note that since programming skills arepre-requisite of thisunit,your tutoris not to help you with the coding part of the assignment.
2.Allyourprogramming source files must be submitted as pecifiedin Section III,and must all follow the standard convention of having a file extension depending on the programming language you use(e.g..java,.py)Do not use package statements in your code.
3.The name of your program mustbe MySearchEngine(i.e.at a minimum your
source code directory must contain a file called MySearchEngine.java which
contains the main() method). You may split yourcode into multiple source files,as long as they compile to produce the final MySearchEngine.class file by issuing the command in instruction #4.
4. It must be possible to compile your program on the server byissuingthe relevant runtime command from within the source code directory e.g.
javac *.java
5.Your program should be able to run from the command line and send itsoutput to standard output (except for the indexreferred to in be stored as a file).
Page 2 of 5
6.Your program must be able to be invoked from the command linewith the
following usage/parameters:java MySearchEngine [command]
where [command]is one of:
a.index collection_dir index_dir stopwords.txt
index all the documentsstored in collection_dir. The index so-constructed
should be stored in index_dir.The indexfile should be named index.txt.See instructions #8 and#9 for the prescribed tokenization/stemming rules and index format.Stopwords are contained in thefile stopwords.txt,a plain text file with one stopword per line.Do not consider the stopwords in the file stopwords.txt for stemming into index terms.
for example:
java MySearchEngine index ~/mydocs ~/myindex ~/stopwords.txt
b. search index_dir num_docs keyword_list return a ranked list of the top num_docs documents that match the query specified in keyword_list.The most relevant document must appear first in the list. Notethat keywords in the query are separated by white space on the command line. Refer to instruction #9 for a more detailed description of what should be returned by this command.
for example:
java MySearchEngine search ~/myindex 10 monash university
7. Whenindexing documents,yourprogrammust first perform appropriate
tokenization and stemming on the sourcedocument content.
You canassume the source documents will be English language and in plaintext.
Tokenization of the documents must follow theserules:
a.Any words hyphenated across aline break mustbe joined into a singletoken
(with the finaltoken not containing the hyphen).
b.Emailaddresses,web URLs and IP addresses must be preserved as a single
token.
c.Text within single quotation marks orinverted commas(e.g.‘Word Press’)
should be placed in single token.
d.Two or more words separated by whitespace,allof which begin with a capital
letter,must be preserved as a single token (i.e. include the whitespacein the token).
e.Acronym should be preserved as a single tokenwith orwithoutfull stop or
period(e.g. C.A.T can result inCAT or C.A.T)
f.For all othertext,split the text into tokensusing as delimiters either whitespace of elements of the following subset of
(note this set includesthe braces themselves).
After tokenization,tokens must be stemmed into index terms using the Porter
stemmer.You mayuse code from the following website to implement the Porter
stemmer(remember to reference the website in the comments in your code):
http://tartarus.org/martin/PorterStemmer/
8.Each record in your index must have the following format (with fieldsseparated by commas,lines separated by the end of line character and any non-integer quantities rounded to 3 decimal places).Inverse document frequencies should be calculated using natural log. Also,the denominator of the classical idf formula should be incremented by one to allow for query terms that do not appear in the index. Note,below,{}indicates a repeating but the {} characters will not appear in your index.term,{doc-id,tf},idf
For example,suppose in a corpusof 10 documents,that the stemmed term cat appears twice in document d4 and once in both documents index entry willbe:cat,d4,2,d6,1,d7,1,0.916
The document-id(doc-id) will be the simple filename ofthe document (e.g. the text that follows the last directory separator character in the absolute pathname of the file)
9.When used with the search parameter,your program willreturna ranked list of documents (i.e.in decreasing order of cosine similarity) matching the query (as represented by the user-supplied query terms). There will be one line in your output for each returned document. The format of each line in your output must be (cosine-score rounded to 3 decimal places):doc-id,cosine-score
10.submit the credit level assignment,follow the appropriate instructions
SectionIII.
Page 4 of 5
II. INSTRUCTIONS FOR THE HIGH-DISTINCTION LEVEL
ASSIGNMENT (MAXMARK 100%):
1. Students maywish togain further marks by extendingthe capability of their engine.To do so,you must first implement all of the instructionsfor the credit level.Remember to keep the high-distinction submission for the CREDIT level (refer to Section III).
2. Seekyour tutor’s approval of how you wish to extend your program.For extensions worthy of the HD grade1 by sending an email, describing what
additional capabilitiesyou wish to add,with the following subject-line:
Student-id-number FIT5166 HD extension
If yourproposal is considered to be worthy of the HD grade should it be
successfully implemented,they will sendyou email approval.Only then may you implement the changes in your code.
3. Document the nature of yourextensions,how they might improve the indexing and/or retrieval process and provide instructions as to how touse your program.
4. To submit the high-distinction levelassignment,follow the appropriate
instructions inSectionIII.
III. ASSIGNMENTSUBMISSION INSTRUCTIONSAND DUEDATES
For assignment specifications,refer to documents provided separately.
Please follow these instructions exactly. Any amendments/clarifications will be posted on the unit website.
Plagiarism warning:
All assignment submissions willbe put through a plagiarism detection software which automatically checks for their similarity with respect to other submissions in all years, and websites. Any plagiarism found will
procedures and may result in severe penalties,up to theUniversity.
Make sure you properlyreference any code and resources that you submit but has been done by other people.
1 Generally,extensions will require modifications of both the indexing and searching components to achieve the HD grade.
Page 5 of 5
The assignments are divided into 2 stages:
1.Assignment Stage 1 due on Moodle on Tuesday 18 September 2018 at 9am(Week 9)
Students are touploada writeup of their plan on howto complete theassignment. The
reportshouldincludethe list of possible functionalities and test cases foreach
functionality.
This part of the assignment is not marked but is a hurdle requirement (must be
submitted for Assignment Stage 2 to be marked).
2. Assignment Stage 2 due on Moodle on Tuesday 9 October 2018 at 9am (Week 11)
Students are to submit the CREDIT leveland,if applicable,the HIGH DISTINCTION level
assignment with the following details.
Each submissionis to include an Experiment Report of the teststhat are conducted,the
resultsof eachtest,the analysis and conclusion drawn from the experiments.
Assignment Credit Level
1. Ensure that before the due date/time,all your java source code filesand test data
files are to bezipped into a file called FirstName-Surname-Assignment-Credit.zip.
2. Submit the Experiment Report (if notsubmitting HD level) and the zip file online
on Moodle.
You will be required to attend an interview regarding your assignment submission.
Assignment HighDistinction Level
1. Ensure that before the due date/time,all your java source code filesand test data
files are to bezipped into a file called FirstName-Surname-Assignment-HD.zip.
2. Submit the Experiment Reportand thezip file onlineon Moodle.
You will be required toattend an interview regarding your assignment submission for your
assignment to be marked.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。