Information Retrieval Assignment
Nigel Ward, UTEP
CS 4390/5319
In this assignment you will gain experience with information retrieval.
A. write a concise description, using equations, of the behavior of baseir.pl
B. test baseir.pl on some of the queries below
C. extend baseir.pl to do at least 4 of the following:
- convert the query terms to lower case
- strip punctuation out of the query terms
- allow quoted phrases in the query to specify exact matches
- use the cosine similarity metric
- use term frequency damping
- use inverse document frequency
- normalize for document length
- use a stoplist
- allow users to specify weights for their query terms
- include an option to "retrieve more documents like this one"
- some other modification (check with the instructor or TA before starting)
baseir.pl, test data, and a stoplist are available in
/share/classes/nigelward/ir-asst/
and /share/classes/nigelward/ir-corpus/
on the Unix workstations.
They may also be downloaded from
http://www.cs.utep.edu/nigel/nlp/
On April 8th classtime will be devoted to this assignment; the TA
will be available in the lab to help you with perl syntax problems.
However it would be wise to do most of the work before class.
Due 15:01, April 10th. You may help classmates, but each person should do
the assignment him/herself. Hand in
- the concise description of baseir.pl (part A)
- a floppy containing your final perl code
- a print-out of the perl code, with your extensions noted,
- the output of your program on the simple queries below
- other output demonstrating the abilities and advantages of your system
- brief discussion of what went wrong for simple queries where the algorithm performed poorly
- brief discussion of how your code was or could be extended to handle the difficult queries
Simple Queries
- Graphics
- karen
- fun
- ethics
- programming assignment
- assignments involving programming
Difficult Queries
- graphics, animation
- late penalty (if the document only includes "late penalties")
- profressional (if the document only includes "professional")
- Human Computer Interaction (if the document only includes "Human-Computer Interaction")
- HCI (if the document only includes "Human Computer Interaction")
- are there a lot of student activities? (if the document only includes "there are many student activities")
- are there a lot of student activities? (if the document only includes "ACM Student Chapter")
Note: this assignment is also at
http://www.cs.utep.edu/nigel/nlp/ir/