OMP4650/6490 Document Analysis – Semester 2 / 2023
Assignment 1
Due 17:00 on Wednesday 16 August 2023 AEST (UTC +10)
Last updated July 28, 2023
Overview
In this assignment, your task is to implement a basic boolean and ranked information retrieval (IR)
system using a document collection, and then measure search performance based on predefined queries.
A document collection containing more than 30, 000 government site descriptions is provided for this
assignment, along with a set of queries (in file gov/topics/gov.topics) and the expected returned
documents (in file gov/qrels/gov.qrels). The provided code implements most of an IR system.
Throughout this assignment you will make changes to the provided code to improve or complete existing
functions. Note that the provided code is designed to be simple to understand and modify, it is not efficient
nor scalable. When developing a real-world IR system you would be better off using high performance
software such as Apache Lucene1.
Throughout this assignment:
1. You will develop a better understanding of indexing, including the tokeniser, parser, and normaliser
components, and how to improve the search performance given a predefined evaluation metric;
2. You will develop a better understanding of search algorithms, and how to obtain better search
results, and
3. You will find the best way to combine an indexer and search algorithm to maximise the performance
of an IR system.
Submission
You will produce an answers file with your responses to each question. Your answers file must be
a PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
The answers to this assignment (including your code files) have to be submitted online in Wattle.
You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NO
DATA.
No late submission will be permitted without a pre-arranged extension. A mark of 0 will be
awarded if not submitted by the due date.
Marking
This assignment will be marked out of 100, and it will contribute 10% of your final course mark.
Your answers to coding questions will be marked based on the quality of your code (is it efficient, is it
readable, is it extendable, is it correct) and the solution in general (is it appropriate, is it reliable, does it
demonstrate a suitable level of understanding).
Your answers to discussion questions will be marked based on how convincing your explanations are (are
they sufficiently detailed, are they well-reasoned, are they backed by appropriate evidence, are they clear,
do they use appropriate aids such as tables and plots where necessary).
1https://lucene.apache.org/
1
This is an individual assignment. Group work is not permitted. Assignments will be checked for
similarities.
Question 1: Implement Boolean Queries (40%)
The construction of inverted index is implemented in indexer.py. You should first run indexer.py to
store the following index data:
A dictionary (called index) mapping a token string to a sorted list of (doc id, term frequency)
tuples,
a dictionary doc freq mapping a token string to its document frequency,
a dictionary (called doc ids) mapping a doc id to the path of the document, and
the number of documents in the collection (called num docs).
Please double check that you are using process tokens original within the function process tokens
in string processing.py before running indexer.py.
Your task is to implement the ability to run boolean queries on the built index. The starting code is
provided in query boolean.py. Specifically, you shall implement a simplified boolean query grammar.
You may assume that input queries consist of only AND and OR operators separated by single tokens. For
example, cat AND dog is a valid query while cat mask AND dog is not a valid query since cat mask is
not a single token. You are not required to implement NOT. The order of operations will be left to right
with no precedence for either of the operators, for example the query cat AND dog OR fish AND fox
should be done in the following order: ((cat AND dog) OR fish) AND fox. The brackets are provided
as an example; you can assume that the queries provided to your system will not contain brackets. To
score full marks on this question, your solution should implement ( + ) sorted list intersection and
sorted list union algorithms – ( +) sorted list intersection was covered in the lectures, while union is
very similar – where and refer to the lengths of the two lists. Solutions using data structures such
as sets or dictionaries to implement the intersection and union operations will not score full marks.
Once you have completed your boolean query system please run it on the following queries and list the
relative paths (e.g., ./gov/documents/14/G00-14-2198849) of the retrieved documents in your answers
PDF file.
HINT: none of the queries below give more than 10 results, and Query 0 has been done for you so that
you can check your system.
Query 0: Workbooks
Answer:
./gov/documents/14/G00-14-2198849
./gov/documents/36/G00-36-2337608
./gov/documents/50/G00-50-0062475
./gov/documents/69/G00-69-1400565
./gov/documents/69/G00-69-2624147
Query 1: Australasia OR Airbase
Query 2: Warm AND WELCOMING
Query 3: Global AND SPACE AND economies
Query 4: SCIENCE OR technology AND advancement AND PLATFORM
Query 5: Wireless OR Communication AND channels OR SENSORY AND INTELLIGENCE
Make sure you submit your code for this question (i.e., query boolean.py) as well as your answers.
2
Question 2: Implement TF-IDF Cosine Similarity (30%)
Your task is to implement the get doc to norm function and the run query function in query tfidf.py to
use TF-IDF and the cosine similarity applied to TF-IDF. In your solution both the query and the document
vectors should be TF-IDF vectors. Your implementation could be similar to the get doc to norm and
run query functions in query.py but should use TF-IDF instead of term frequency.
The TF-IDF variant you should implement is:
TF-IDF = ln
1 + ,
where is the term frequency, is the document frequency, and is the total number of documents
in the collection. This is almost the standard TF-IDF variant, except that 1 is added to the document
frequency to avoid division by zero errors.
Once you have implemented TF-IDF cosine similarity, run the query tfidf.py file and record the top-5
retrieved documents as well as their similarity scores in your answers PDF file for two queries below:
Query 1: Is nuclear power plant eco-friendly?
Query 2: How to stay safe during severe weather?
You should then run evaluate.py to evaluate the query results (for the queries in gov/topics/gov.topics)
against the ground truth, and record the evaluation results in your answers PDF file. Make sure you submit
your query tfidf.py.
Question 3: Explore Linguistic Processing Techniques (30%)
For this question you will explore ways to improve the process tokens function in string processing.py.
The current function removes stopwords and lowercases the tokens. You should modify the function and
explore the results. To modify the function, you should make changes to the functions process token 1,
process token 2, and process token 3 and then uncomment the one you want to test within the main
process tokens function. You should pick at least three different modifications and evaluate them (you
can add new process tokens functions if you want to evaluate more than three modifications). See lectures
for some possible modifications. You might find the Python nltk library useful. The modifications
you make do not need to require significant coding, the focus of this question is choosing reasonable
modifications and explaining the results.
To evaluate each modification you made, you should
(1) run indexer.py to rebuild the index data, then
(2) run query.py (not query tfidf.py), and
(3) run evaluate.py to evaluate the query results.
For each of the modification you make you should describe in your answers:
What modifications you made.
Why you made them (in other words why you thought they might work).
What the new performance is.
Why you think the modification did/did not work. Making sure to give (and explain) examples of
possible failure or success cases.
Finally, you should compare all the modifications by choosing one appropriate metric and decide which
modification (or combination of modifications) performed the best. Your comparison should make use
of a table or chart as well as some discussion. Make sure to report all of this and your justification in
your answers, and to submit your string processing.py showing each of the changes you made.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。