Information Retrieval Assignment

Nigel Ward, UTEP
CS 4390/5319

In this assignment you will gain experience with information retrieval.

A. write a concise description, using equations, of the behavior of baseir.pl

B. test baseir.pl on some of the queries below

C. extend baseir.pl to do at least 4 of the following:

  1. convert the query terms to lower case
  2. strip punctuation out of the query terms
  3. allow quoted phrases in the query to specify exact matches
  4. use the cosine similarity metric
  5. use term frequency damping
  6. use inverse document frequency
  7. normalize for document length
  8. use a stoplist
  9. allow users to specify weights for their query terms
  10. include an option to "retrieve more documents like this one"
  11. some other modification (check with the instructor or TA before starting)

baseir.pl, test data, and a stoplist are available in /share/classes/nigelward/ir-asst/ and /share/classes/nigelward/ir-corpus/ on the Unix workstations. They may also be downloaded from http://www.cs.utep.edu/nigel/nlp/

On April 8th classtime will be devoted to this assignment; the TA will be available in the lab to help you with perl syntax problems. However it would be wise to do most of the work before class.

Due 15:01, April 10th. You may help classmates, but each person should do the assignment him/herself. Hand in

Simple Queries

  1. Graphics
  2. karen
  3. fun
  4. ethics
  5. programming assignment
  6. assignments involving programming

Difficult Queries

  1. graphics, animation
  2. late penalty (if the document only includes "late penalties")
  3. profressional (if the document only includes "professional")
  4. Human Computer Interaction (if the document only includes "Human-Computer Interaction")
  5. HCI (if the document only includes "Human Computer Interaction")
  6. are there a lot of student activities? (if the document only includes "there are many student activities")
  7. are there a lot of student activities? (if the document only includes "ACM Student Chapter")

Note: this assignment is also at http://www.cs.utep.edu/nigel/nlp/ir/