Building a PDF-Based Question-Answering System Using BERT and GPT

Aug 23, 2024

In the realm of natural language processing (NLP), the ability to extract meaningful information from large documents is a powerful tool. This blog will walk you through the creation of a Python-based application that reads a PDF file, processes its text using BERT embeddings, and answers user queries by leveraging OpenAI's GPT model. We'll see into the provided code, explaining each component's role in building this system.

Introduction

The objective of this project is to create a command-line application that can take a PDF file as input, process its text, and allow users to ask questions about the content. The application utilizes a combination of BERT for sentence embeddings and GPT (via OpenAI) for generating answers to user queries.

What You Will Learn

By following this tutorial, you will understand:

How to extract text from a PDF file using PyPDF2.
How to preprocess text using BERT to generate embeddings.
How to handle user queries and find relevant text within the PDF using cosine similarity.
How to generate responses to queries using OpenAI's GPT model.

Prerequisites

Before starting, ensure you have the following installed:

PyPDF2 (pip install PyPDF2)
PyTorch (pip install torch)
Transformers (pip install transformers)
OpenAI (pip install openai)
Spacy (pip install spacy)
NLTK (pip install nltk)
Scikit-learn (pip install scikit-learn)

Additionally, you will need an OpenAI API key to use the GPT model.

Understanding the Provided Code

Let’s see into the provided code, breaking it down into its core components.

1. Importing Required Libraries

import argparse
import PyPDF2
import openai
import nltk
import torch
import spacy
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from openapi import OPEN_API_KEY

The script begins by importing necessary libraries:

argparse: For handling command-line arguments.
PyPDF2: For extracting text from PDF files.
openai: To interact with OpenAI’s API for GPT.
nltk, spacy: For natural language processing tasks.
torch, transformers: For working with BERT, including tokenization and embedding generation.
scikit-learn: Specifically for computing cosine similarity between text embeddings.

2. Setting Up NLP Tools and API Key

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
nlp = spacy.load("en_core_web_sm")
openai.api_key = OPEN_API_KEY

NLTK and SpaCy Initialization: NLTK is set up to tokenize sentences, and SpaCy is initialized with an English language model for more advanced NLP tasks.
API Key: The OpenAI API key is set using an environment variable or a configuration file.

3. Suppressing Warnings

import logging
from transformers import logging as hf_logging

logging.basicConfig(level=logging.ERROR)
hf_logging.set_verbosity_error()

These lines suppress unnecessary warnings from the transformers library, ensuring a cleaner output during execution.

4. Loading BERT Model and Tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Here, we load the pre-trained BERT tokenizer and model from the Hugging Face library:

Tokenizer: Converts text into tokens that the BERT model can process.
Model: The BERT model generates embeddings for the text, which are later used to compute similarities.

5. Processing the PDF File

def process_pdf(file_path):
    pdf_file = open(file_path, 'rb')
    read_pdf = PyPDF2.PdfReader(pdf_file)
    text = ""
    for i in range(len(read_pdf.pages)):
        page = read_pdf.pages[i]
        text += page.extract_text()
    return text

This function handles the extraction of text from a PDF file:

Reading the PDF: Opens the PDF and reads its content page by page.
Text Extraction: Extracts the text from each page and concatenates it into a single string.

6. Preprocessing Text and Generating Embeddings

def preprocess_text(text):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    embeddings = []
    for sent in sentences:
        inputs = tokenizer(sent, truncation=True, max_length=512, return_tensors='pt')
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).numpy())
    embeddings = np.squeeze(np.array(embeddings), axis=1)
    return np.array(embeddings), sentences

This function processes the extracted text:

Sentence Segmentation: The text is split into sentences using SpaCy.
Embedding Generation: Each sentence is tokenized and passed through BERT to generate sentence embeddings.
Embedding Storage: The embeddings are stored in a list and later converted into a NumPy array.

7. Handling Direct Queries with GPT

def handle_direct_query(query, text):
    max_tokens = 800
    max_prompt_tokens = 4097 - max_tokens
    text_tokens = tokenizer.encode(text)
    query_tokens = tokenizer.encode(query)
    total_token_length = len(text_tokens) + len(query_tokens)
    
    if total_token_length > max_prompt_tokens:
        text_tokens = text_tokens[:max_prompt_tokens - len(query_tokens)]
        text = tokenizer.decode(text_tokens)
        
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"Using the Input Text: {text}\n\n Please answer the Question: {query}\nA:",
        temperature=0.9,
        max_tokens=max_tokens
    )
    return response.choices[0].text.strip()

This function handles queries by interacting with the GPT model:

Token Length Management: Ensures that the combined length of the input text and query stays within the model's token limit.
GPT Completion: Sends a prompt to OpenAI's GPT-3 (text-davinci-003) and retrieves the generated answer.

8. Finding Similar Sentences

def find_similar_sentences(query, sentences, embeddings, N=90):
    inputs = tokenizer(query, truncation=True, max_length=512, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    query_embedding = outputs.last_hidden_state.mean(dim=1).numpy()
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = np.argsort(similarities)[-N:]
    top_sentences = [sentences[i] for i in top_indices]
    return " ".join(top_sentences)

This function finds sentences in the PDF text that are most similar to the user's query:

Query Embedding: The user's query is converted into an embedding.
Cosine Similarity: The similarity between the query and each sentence in the PDF is calculated.
Top Sentences: The top N most similar sentences are selected and combined to form a relevant context for the query.

9. Main Function and Command-Line Interface

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", help="Path to the PDF file")
    args = parser.parse_args()

    text = process_pdf(args.file)
    embeddings, sentences = preprocess_text(text)

    while True:
        query = input("Enter your query: ")
        similar_sentences = find_similar_sentences(query, sentences, embeddings)
    
        answer = handle_direct_query(query, similar_sentences)
        print(answer)

if __name__ == "__main__":
    main()

The main function orchestrates the application's workflow:

Argument Parsing: The PDF file path is taken as a command-line argument.
Text Processing: The text is extracted from the PDF and preprocessed to generate embeddings.
Interactive Query Handling: The application enters an interactive loop where the user can input queries, and the system responds with relevant answers.

Running the Application

To run the application, save the provided code as chatpdf.py. Then, use the command line to execute the script:

python chatpdf.py <path_to_pdf_file>

Replace <path_to_pdf_file> with the actual path to your PDF document. You will then be able to input queries and receive answers based on the content of the PDF.

Complete Code

import argparse
import PyPDF2
import openai
import nltk
import torch
import spacy
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from openapi import OPEN_API_KEY

# Add other required imports
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
nlp = spacy.load("en_core_web_sm")
openai.api_key = OPEN_API_KEY

# To suppress warning from loading the embedding model
import logging
from transformers import logging as hf_logging

logging.basicConfig(level=logging.ERROR)
hf_logging.set_verbosity_error()

# Note Other Varitaion's where tried i.e using Word2Vec embedding, other mini transformer for sentence embeddings
# But for the questions from type 1 was only performing good while the questions from the type 2 and 3 where not
# Giving appropriate answers.
# So had to shift to BERT embeddings.

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def process_pdf(file_path):
    # Use PyPDF2 to extract text from the PDF
    pdf_file = open(file_path, 'rb')
    read_pdf = PyPDF2.PdfReader(pdf_file)
    text = ""
    for i in range(len(read_pdf.pages)):
        page = read_pdf.pages[i]
        text += page.extract_text()
    return text

def preprocess_text(text):
    # Use NLP libraries to preprocess the text and generate embeddings
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    embeddings = []
    for sent in sentences:
        inputs = tokenizer(sent, truncation=True, max_length=512, return_tensors='pt')
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).numpy())
    embeddings = np.squeeze(np.array(embeddings), axis=1)
    return np.array(embeddings), sentences

def handle_direct_query(query, text):
    # Define Max tokens
    max_tokens = 800
    
    # Compute the maximum allowable tokens for the prompt
    max_prompt_tokens = 4097 - max_tokens
    
    # Compute the token length of the text and query
    text_tokens = tokenizer.encode(text)
    query_tokens = tokenizer.encode(query)
    
    # Compute the total token length
    total_token_length = len(text_tokens) + len(query_tokens)
    
    # Truncate the text tokens if necessary
    if total_token_length > max_prompt_tokens:
        text_tokens = text_tokens[:max_prompt_tokens - len(query_tokens)]
        text = tokenizer.decode(text_tokens)
        
    # Handle direct queries
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=f"Using the Input Text: {text}\n\n Please answer the Question: {query}\nA:",
        temperature=0.9,
        max_tokens=max_tokens
    )
    return response.choices[0].text.strip()

# Function to find top N similar sentences to the query
def find_similar_sentences(query, sentences, embeddings, N=90):
    inputs = tokenizer(query, truncation=True, max_length=512, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    query_embedding = outputs.last_hidden_state.mean(dim=1).numpy()
    similarities = cosine_similarity(query_embedding, embeddings)[0]
    top_indices = np.argsort(similarities)[-N:]
    top_sentences = [sentences[i] for i in top_indices]
    return " ".join(top_sentences)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("file", help="Path to the PDF file")
    args = parser.parse_args()

    text = process_pdf(args.file)
    embeddings, sentences = preprocess_text(text)

    while True:
        query = input("Enter your query: ")
        similar_sentences = find_similar_sentences(query, sentences, embeddings)
    
        answer = handle_direct_query(query, similar_sentences)
        print(answer)

if __name__ == "__main__":
    main()

Project Demov Video

This project demonstrates the power of combining different NLP tools—like BERT for embeddings and GPT for language generation—to build a robust question-answering system. By following this guide, you now have a fully functional command-line application that can extract text from a PDF, process it, and interactively answer questions about its content.

For the complete solution or any help regarding the NLP assignment help feel free to contact us.