联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2019-01-30 10:49

Assessed Coursework

Course Name Information Retrieval H/M

Coursework Number 1

Deadline Time: 04:30 pm Date: 15th February 2019

% Contribution to final

course mark

8%

Solo or Group ü Solo ü Group

Anticipated Hours 10 hours

Submission Instructions

See details in assignment.

Please Note: This Coursework cannot be Re-Done

Code of Assessment Rules for Coursework Submission

Deadlines for the submission of coursework which is to be formally assessed will be published in course

documentation, and work which is submitted later than the deadline will be subject to penalty as set out

below.

The primary grade and secondary band awarded for coursework which is submitted after the published

deadline will be calculated as follows:

(i) in respect of work submitted not more than five working days after the deadline

a. the work will be assessed in the usual way;

b. the primary grade and secondary band so determined will then be reduced by two secondary

bands for each working day (or part of a working day) the work was submitted late.

(ii) work submitted more than five working days after the deadline will be awarded Grade H.

Penalties for late submission of coursework will not be imposed if good cause is established for the late

submission. You should submit documents supporting good cause via MyCampus.

Penalty for non-adherence to Submission Instructions is 2 bands

You must complete an “Own Work” form via

https://studentltc.dcs.gla.ac.uk/ for all coursework

Version 2.0 – 16th January 2

Information Retrieval H/M

Exercise 1

January 2019

Introduction

The general objective of this exercise is to deploy an IR system and evaluate it on a medium size

Web dataset. Students will use the Terrier Information Retrieval platform (http://terrier.org) to

conduct their experiments. Terrier is a modular information retrieval platform, allowing to

experiment with various test collections and retrieval techniques. It is written in Java. This first

exercise will involve installing and configuring Terrier to evaluate a number of retrieval models and

approaches. In this exercise, you will be familiarising yourself with Terrier by deploying various

retrieval approaches and evaluating their impact on retrieval performance, as well as learning how

to conduct an experiment in IR, and how to analyse results.

You will need to download the latest version of Terrier from http://terrier.org. We will provide a

sample of Web documents (a signed user agreement to access and use this sample will be required),

on which you will conduct your experiments. We will give a tutorial on Terrier in the class, and you

could also use the Terrier Public Forum to ask questions.

Your work will be submitted through the Exercise 1 Quiz Instance available on Moodle. The Quiz

asks you various questions, which you should answer based on the experiments you have conducted.

Collection:

You will use a sample of a TREC Web test collection, of approx. 800k documents, with corresponding

topics & relevance assessments. Note that you will need to sign an agreement to access this

collection (See Moodle). The agreement needs to be signed by 16th January 2019, so that we can

open the directory the following day. You can find the document corpus and other resources in the

Unix directory /users/level4/software/IR/. This directory contains:

Dotgov_50pc/ (approx. 2.8GB) – the collection to index.

TopicsQrels/ - topics & qrels for three topic sets from TREC 2004: homepage, namedpage,

topic-distillation.

Resources/ - features & indices provided by us for Exercise 2; not used for Exercise 1.

Exercise Specification

There is little programming in this exercise but there are numerous experiments that need to be

conducted. In particular, you will conduct three tasks:

1. Index the provided Web collection using Terrier’s default indexing setup.

2. Implement a Simple TF*IDF weighting model (as described in Lecture 3, which you will have

to add to Terrier).

3. Evaluate and analyse the resulting system by performing the following experiments:

- Vary the weighting model: Simple TF.IDF vs. Terrier’s implemented TF.IDF vs. BM25 vs.

PL2.

- Apply a Query Expansion mechanism: Use of a Query Expansion Mechanism vs. Non-use of

Query Expansion

These are too many experimental parameters to address all at once, hence you must follow the

prescribed activities given below. Once you conduct an activity, you should answer the

Version 2.0 – 16th January 3

corresponding questions on the Exercise 1 Quiz instance. Ensure that you click the “Next Page”

button to save your answers on the Quiz instance.

Q1. Start by using Terrier’s default indexing setup: Porter Stemming & Stopwords removed. You will

need to index the collection, following the instructions in Terrier’s documentation. In addition, we

would like you to configure Terrier with the following additional property during indexing:

indexer.meta.reverse.keys=docno

Once you have indexed the collection, answer the Quiz questions asking you to enter your main

obtained indexing statistics (number of tokens, size of files, time to index, etc).

[1 mark]

Q2. Implement and add a new Simple TF*IDF class in Terrier following the instructions in Terrier’s

documentation. The Simple TF*IDF weighting model you are required to implement is highlighted in

Lecture 3. Use the template class provided in the IRcourseHM project, available from the Github

repo (https://github.com/cmacdonald/IRcourseHM).

Paste your Simple TF*IDF Java method code when prompted by the Quiz instance. Then, answer the

corresponding questions by inspecting the retrieved results for the mentioned weighting models.

[5 marks]

Q3. Now you will experiment with all four weighting models (Simple TF*IDF, Terrier TF*IDF, BM25

and PL2) and analyse their results on 3 different topic sets, representing different Web retrieval

tasks: homepage finding (HP04), named page finding (NP04), and topic distillation (TD04). A

description of these topic sets and the underlying search tasks is provided on Moodle.

Provide the required MAP performances of each of the weighting models over the 3 topic sets.

Report your MAP performances to 4 decimal places. Also, provide the average MAP performance

of each weighting model across the three topic sets, when prompted by the Quiz instance.

[16 marks]

Next, for each topic set (HP04, NP04, TD04), draw a single Recall-Precision graph showing the

performances for each of the 4 alternative weighting models (three Recall-Precision graphs in total).

Upload the resulting graphs into the Moodle instance when prompted. Then, answer the

corresponding question(s) on the Quiz instance.

[5 marks]

Finally, you should now answer on the quiz the most effective weighting model (in terms of Mean

Average Precision), which you will use for the rest of Exercise 1. To identify this model, simply

identify the weighting model with the highest average performance across the 3 topic sets.

[1 mark]

Q4. You will now conduct the Query Expansion experiments using the weighting model that produces

the highest Mean Average Precision (MAP) across the 3 topic sets in Q3.

Query expansion has a few parameters, e.g. query expansion model, number of documents to

analyse, number of expansion terms – you should simply use the default query expansion settings of

Terrier: Bo1, 3 documents, 10 expansion terms.

First, run the best weighting model you identified in Q3 with Query Expansion on the homepage

finding (HP04), named page finding (NP04), and topic distillation (TD04) topic sets. Report the

obtained MAP performances in the Quiz instance. Report your MAP performances to 4 decimal

places.

[6 marks]

Version 2.0 – 16th January 4

Next, for each topic set (HP04, NP04, TD04) draw a single Recall-Precision graph comparing the

performances of your system with and without the application of Query Expansion. Upload these

graphs into the Quiz instance.

[3 marks]

Now, for each topic set (HP04, NP04, TD04) draw a separate query-by-query histogram comparing

the MAP performance of your system with and without query expansion (three histograms to produce

in total). Each histogram should show two bars for each query of the topic set: one bar corresponding

to the MAP performance of the system on that given query with query expansion and one bar

corresponding to the MAP performance of the system on that given query without query expansion.

Using these histograms and their corresponding data, you should now be able to answer the

corresponding questions of the Quiz instance.

[6 marks]

Finally, answer the final analysis questions and complete your Quiz submission.

[7 marks]

Hand-in Instructions: All your answers to Exercise 1 must be submitted on the Exercise 1 Quiz

instance, which will be available on Moodle. This exercise is worth 50 marks and 8% of the final

course grade.

NB 1: You can (and should) naturally complete the answers to the quiz over several iterations.

However, please ensure that you save your intermediary work on the Quiz instance by clicking

the “Next Page” button every time you make any change in a given page of the quiz and you

want it to be saved.

NB 2: To save you a lot of time, you are encouraged to write scripts for the collection and

management of your experimental data (and to ensure that you don’t mix up your results) as well

as the production of graphs using a plethora of existing tools.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp