Assessed Coursework
Course Name Information Retrieval H/M
Coursework Number 1
Deadline Time: 04:30 pm Date: 15th February 2019
% Contribution to final
course mark
8%
Solo or Group ü Solo ü Group
Anticipated Hours 10 hours
Submission Instructions
See details in assignment.
Please Note: This Coursework cannot be Re-Done
Code of Assessment Rules for Coursework Submission
Deadlines for the submission of coursework which is to be formally assessed will be published in course
documentation, and work which is submitted later than the deadline will be subject to penalty as set out
below.
The primary grade and secondary band awarded for coursework which is submitted after the published
deadline will be calculated as follows:
(i) in respect of work submitted not more than five working days after the deadline
a. the work will be assessed in the usual way;
b. the primary grade and secondary band so determined will then be reduced by two secondary
bands for each working day (or part of a working day) the work was submitted late.
(ii) work submitted more than five working days after the deadline will be awarded Grade H.
Penalties for late submission of coursework will not be imposed if good cause is established for the late
submission. You should submit documents supporting good cause via MyCampus.
Penalty for non-adherence to Submission Instructions is 2 bands
You must complete an “Own Work” form via
https://studentltc.dcs.gla.ac.uk/ for all coursework
Version 2.0 – 16th January 2
Information Retrieval H/M
Exercise 1
January 2019
Introduction
The general objective of this exercise is to deploy an IR system and evaluate it on a medium size
Web dataset. Students will use the Terrier Information Retrieval platform (http://terrier.org) to
conduct their experiments. Terrier is a modular information retrieval platform, allowing to
experiment with various test collections and retrieval techniques. It is written in Java. This first
exercise will involve installing and configuring Terrier to evaluate a number of retrieval models and
approaches. In this exercise, you will be familiarising yourself with Terrier by deploying various
retrieval approaches and evaluating their impact on retrieval performance, as well as learning how
to conduct an experiment in IR, and how to analyse results.
You will need to download the latest version of Terrier from http://terrier.org. We will provide a
sample of Web documents (a signed user agreement to access and use this sample will be required),
on which you will conduct your experiments. We will give a tutorial on Terrier in the class, and you
could also use the Terrier Public Forum to ask questions.
Your work will be submitted through the Exercise 1 Quiz Instance available on Moodle. The Quiz
asks you various questions, which you should answer based on the experiments you have conducted.
Collection:
You will use a sample of a TREC Web test collection, of approx. 800k documents, with corresponding
topics & relevance assessments. Note that you will need to sign an agreement to access this
collection (See Moodle). The agreement needs to be signed by 16th January 2019, so that we can
open the directory the following day. You can find the document corpus and other resources in the
Unix directory /users/level4/software/IR/. This directory contains:
Dotgov_50pc/ (approx. 2.8GB) – the collection to index.
TopicsQrels/ - topics & qrels for three topic sets from TREC 2004: homepage, namedpage,
topic-distillation.
Resources/ - features & indices provided by us for Exercise 2; not used for Exercise 1.
Exercise Specification
There is little programming in this exercise but there are numerous experiments that need to be
conducted. In particular, you will conduct three tasks:
1. Index the provided Web collection using Terrier’s default indexing setup.
2. Implement a Simple TF*IDF weighting model (as described in Lecture 3, which you will have
to add to Terrier).
3. Evaluate and analyse the resulting system by performing the following experiments:
- Vary the weighting model: Simple TF.IDF vs. Terrier’s implemented TF.IDF vs. BM25 vs.
PL2.
- Apply a Query Expansion mechanism: Use of a Query Expansion Mechanism vs. Non-use of
Query Expansion
These are too many experimental parameters to address all at once, hence you must follow the
prescribed activities given below. Once you conduct an activity, you should answer the
Version 2.0 – 16th January 3
corresponding questions on the Exercise 1 Quiz instance. Ensure that you click the “Next Page”
button to save your answers on the Quiz instance.
Q1. Start by using Terrier’s default indexing setup: Porter Stemming & Stopwords removed. You will
need to index the collection, following the instructions in Terrier’s documentation. In addition, we
would like you to configure Terrier with the following additional property during indexing:
indexer.meta.reverse.keys=docno
Once you have indexed the collection, answer the Quiz questions asking you to enter your main
obtained indexing statistics (number of tokens, size of files, time to index, etc).
[1 mark]
Q2. Implement and add a new Simple TF*IDF class in Terrier following the instructions in Terrier’s
documentation. The Simple TF*IDF weighting model you are required to implement is highlighted in
Lecture 3. Use the template class provided in the IRcourseHM project, available from the Github
repo (https://github.com/cmacdonald/IRcourseHM).
Paste your Simple TF*IDF Java method code when prompted by the Quiz instance. Then, answer the
corresponding questions by inspecting the retrieved results for the mentioned weighting models.
[5 marks]
Q3. Now you will experiment with all four weighting models (Simple TF*IDF, Terrier TF*IDF, BM25
and PL2) and analyse their results on 3 different topic sets, representing different Web retrieval
tasks: homepage finding (HP04), named page finding (NP04), and topic distillation (TD04). A
description of these topic sets and the underlying search tasks is provided on Moodle.
Provide the required MAP performances of each of the weighting models over the 3 topic sets.
Report your MAP performances to 4 decimal places. Also, provide the average MAP performance
of each weighting model across the three topic sets, when prompted by the Quiz instance.
[16 marks]
Next, for each topic set (HP04, NP04, TD04), draw a single Recall-Precision graph showing the
performances for each of the 4 alternative weighting models (three Recall-Precision graphs in total).
Upload the resulting graphs into the Moodle instance when prompted. Then, answer the
corresponding question(s) on the Quiz instance.
[5 marks]
Finally, you should now answer on the quiz the most effective weighting model (in terms of Mean
Average Precision), which you will use for the rest of Exercise 1. To identify this model, simply
identify the weighting model with the highest average performance across the 3 topic sets.
[1 mark]
Q4. You will now conduct the Query Expansion experiments using the weighting model that produces
the highest Mean Average Precision (MAP) across the 3 topic sets in Q3.
Query expansion has a few parameters, e.g. query expansion model, number of documents to
analyse, number of expansion terms – you should simply use the default query expansion settings of
Terrier: Bo1, 3 documents, 10 expansion terms.
First, run the best weighting model you identified in Q3 with Query Expansion on the homepage
finding (HP04), named page finding (NP04), and topic distillation (TD04) topic sets. Report the
obtained MAP performances in the Quiz instance. Report your MAP performances to 4 decimal
places.
[6 marks]
Version 2.0 – 16th January 4
Next, for each topic set (HP04, NP04, TD04) draw a single Recall-Precision graph comparing the
performances of your system with and without the application of Query Expansion. Upload these
graphs into the Quiz instance.
[3 marks]
Now, for each topic set (HP04, NP04, TD04) draw a separate query-by-query histogram comparing
the MAP performance of your system with and without query expansion (three histograms to produce
in total). Each histogram should show two bars for each query of the topic set: one bar corresponding
to the MAP performance of the system on that given query with query expansion and one bar
corresponding to the MAP performance of the system on that given query without query expansion.
Using these histograms and their corresponding data, you should now be able to answer the
corresponding questions of the Quiz instance.
[6 marks]
Finally, answer the final analysis questions and complete your Quiz submission.
[7 marks]
Hand-in Instructions: All your answers to Exercise 1 must be submitted on the Exercise 1 Quiz
instance, which will be available on Moodle. This exercise is worth 50 marks and 8% of the final
course grade.
NB 1: You can (and should) naturally complete the answers to the quiz over several iterations.
However, please ensure that you save your intermediary work on the Quiz instance by clicking
the “Next Page” button every time you make any change in a given page of the quiz and you
want it to be saved.
NB 2: To save you a lot of time, you are encouraged to write scripts for the collection and
management of your experimental data (and to ensure that you don’t mix up your results) as well
as the production of graphs using a plethora of existing tools.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。