May 3, 2024

Building an Information Retrieval System: Implementation, Evaluation, and Analysis

Introduction

Welcome to this new blog. In this post, we’re going to discuss a new project requirement which is "Building an Information Retrieval System: Implementation, Evaluation, and Analysis". The project aims to develop an Information Retrieval System capable of indexing documents, performing queries, and generating ranked lists of documents, utilizing techniques such as Vector Space Model, BM25, and Language Models.

We'll walk you through the project requirements, highlighting the tasks at hand. Then, in the solution approach section, we'll delve into what we will accomplish, discussing the techniques we will apply.

Let's get started!

Project Requirement

In this project, you will apply the knowledge acquired during the first two courses of this module to implement a simple Information Retrieval system able to index a collection of documents, perform queries over it, and generate an output in the form of a ranked list of documents. As part of this project, you will also conduct a first evaluation of this system in order to assess its performance.

More specifically, the project will include the implementation of the following components:

Indexing.
Search and ranking. You need to implement three different retrieval models: Vector Space Model, BM25, and one Language Model of your choice.
Evaluation. You will build a pipeline for evaluating your system and compare the three different models that you have implemented. To generate the results of your experimental evaluation, you should use the TREC evaluation script ( trec_eval 9.0.7.tar.gz ) that you can find here. A description of the metrics computed by this script can be found here. Please, follow the instructions in the README file to compile and run the script.

Data

In this project, you will be working with a subset of the CLEF ad-hoc robust task collection. This collection contains 303 topics and 29,459 documents extracted from newspapers.

The collection is covered by copyright, this means that in order to be able to download the collection you need to sign and return to annalina.caputo@dcu.ie the CLEF 2009 End User Agreement. You can find the agreement at this link.

It is important to point out that by this agreement the "ENDUSER is not permitted to reproduce the Language Resources for commercial or distribution purposes and to commercialize (or distribute for free) in any form or by any means the Language Resources or any derivative product or services based on all or a substantial part of it."

By ENDUSER here it is meant you, the student, and by Language Resources it is meant the collection (documents and queries).

The collection consist of the following files:

topics.zip a zipped folder containing all the 303 topic files.

A topic represents a query. Each topic is an XML file structured as follows:

<?xml version="1.0" ?>

<DOC>

<HEADLINE> FCC CHIEF WANTS TALK RADIO SHOWS TO DEAL IN ' TRUE FACTS '</HEADLINE>

<TEXT> Federal Communications Commission Chairman Reed Hundt called_on the nation ' s broadcasters Thursday to conduct a more responsible " electronic public discussion " on talk radio and to improve minority hiring policies . In his speech to members of the National Assn . of Broadcasters , who were meeting in Los Angeles for their annual convention , Hundt urged broadcasting executives to ensure that talk radio focuses_on " true facts " and avoid sensationalism and " engendering skepticism and disbelief. " " One-third of all talk radio listeners say they listen_in order </TEXT>

</DOC>

The QUERYID fields contain the query identifier to be used when generating the result file. TITLE contains the title of the query, this is the field normally used in TREC-like evaluation to generate the query.

DESC provides a longer description of the information need. It is sometimes used for generating the query. You are allowed to use this field in your evaluation. NARR provides a description to the human annotators about what is considered relevant or not for the evaluation purpose. This field is normally used when creating the relevance judgment and MUST not be used when creating a query.

collection.zip is a compressed folder containing all the documents in this collection.

Documents are in XML format structured as follows:

<?xml version="1.0" ?>

<DOC>

<HEADLINE> FCC CHIEF WANTS TALK RADIO SHOWS TO DEAL IN ' TRUE FACTS '</HEADLINE>

policies . In his speech to members of the National Assn . of Broadcasters , who were meeting in Los Angeles for their annual convention , Hundt urged broadcasting executives to ensure that talk radio focuses_on " true facts " and avoid sensationalism and " engendering skepticism and disbelief. " " One-third of all talk radio listeners say they listen_in order </TEXT>

</DOC>

The DOCID contains the document identifier to be used when generating the results file. HEADLINE represents the headline of the news, while TEXT contains the body of the news. Please, notice that each token (punctuation included) in the HEADLINE and TEXT field is separated by a whitespace .

train_qrel.zip contains the qrel file to be given as an input to the trec_eval script to generate the evaluation metrics results.

The file contains the relevance judgments in the trec_eval format:

query_id <space> track_id <space> document_id <space> relevance_degree Notice that:

track_id is an identifier of your experiment, here you can use any string and it is irrelevant for the purpose of the evaluation.

Programming Language & Code

You can use either Python or Java for this project. Along with your code, you need to provide a Readme file explaining all the steps required to run your program. This will include any dependencies from external libraries.

Ideally, you should use either a virtual environment with the associated requirements.txt file (Python) or maven with the associated pom.xml file (Java) to manage the dependencies of your project.

Failure to report any steps required to run the project or any external dependency will result in 0 marks for the execution part of this project (see below the marking scheme). Your project will be evaluated in terms of overall architecture, choices taken and rational behind them, but also your code, which should be well organised and documented.

Report

Along with the code of your project, you should submit a report documenting your activity. The report should be written in ACM sigconf template:

latex

Word

The report length should be between 6 and 10 pages (exceeding pages will not be considered and the corresponding grades will be lost. Report shorter than 6 pages will also be penalized.)

The report must contain a link to the repository of your code.

The report documents your project. You should provide a general architecture of your system and provide motivations behind specific choices that you had to take when implementing the different components of your system. When these choices were supported by scientific publications, you should provide references and include them in the Bibliography section.

Your report should contain at least the following sections:

Abstract. A quick overview of the content of the report.
Introduction. An introduction to the problem and an overview of the architecture of your system.
Indexing. A section describing the process of indexing the collection. Here you can include specific choices that you have made in terms of:
document analysis and pre-processing
indexing construction
data structures
Search and ranking. A section describing the search and ranking component of your system. Here you can provide details about how you implemented the retrieval and ranking of documents, as well as specific choices and motivation behind data structures.
Evaluation. In this section, you should provide details about how you have tackled specifically the Cranfield collection, in terms of the document structure, query creation, pre-processing, etc. This section must provide a table with the evaluation results in terms of MAP, P5, and NDCG. You must also provide a discussion of the results.
Conclusions. This section provides an overview of the main findings of your project: what worked well and what did not, what you would change and how this work can be extended in the future.

Solution Approach :

In This project, we employed several methods and techniques to implement an Information Retrieval (IR) system capable of indexing documents, performing queries, and generating ranked lists of documents. Here's a breakdown of the approach we took:

Vector Space Model (VSM) Implementation:For implementing the Vector Space Model, we utilized Python with libraries such as Pandas for data manipulation and scikit-learn for vectorization and cosine similarity calculations. The core script vec_space_model_complete_UPDATE.py encompasses the entire process from data loading to result generation. Here's an overview of the steps:

Data Loading: We loaded query and document data from CSV files using Pandas.
Text Preprocessing: Preprocessing steps included removing punctuation and converting text to lowercase to normalize the data.
Vectorization: We employed the TF-IDF vectorization technique from scikit-learn to convert text data into numerical vectors.
Cosine Similarity: Using the vectorized representations, we calculated cosine similarity between query and document vectors to rank the documents.
Output Generation: Finally, we generated an output file containing query-document pairs along with their relevance scores.

XML Parsing for Data Extraction:Another essential aspect of our project was extracting data from XML files containing topic queries and document information. We developed a Python script xml_parse.py for this purpose. Here's what it does:

XML Parsing: Using the ElementTree library, the script parses XML files to extract query and document metadata.
Data Structuring: Extracted data is structured into Pandas DataFrames for further processing and analysis.
Data Export: The parsed data is exported to CSV files for ease of access and integration with other components of the system.

System Integration and Data Preprocessing:Lastly, we integrated the parsed data with our main processing pipeline using Python scripts like resit_main.py. This script facilitated the integration of query, document, and relevance judgment data for comprehensive system evaluation. Key steps included:

Loading query and document data.
Converting relevance judgment data to a standardized format.
Exporting integrated data for further analysis and evaluation.

If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.