Creating an Extractive Text Summarization Model Using BERT

Aug 23, 2024

Extractive summarization is a natural language processing (NLP) technique where the most important sentences or phrases are extracted from a text to create a concise summary. This blog will guide you through building an extractive text summarization model using BERT, a state-of-the-art NLP model. We will walk through the provided code in a Jupyter Notebook (Bert_Extractive.ipynb) and explain each component in detail.

Introduction

The goal of this project is to develop an extractive summarization model using BERT, which will take a large text input and extract the most relevant sentences to create a summary. We will use a pre-trained BERT model from the bert-extractive-summarizer library to achieve this. The code also demonstrates how to preprocess the data, generate summaries, and save the results to a CSV file.

What You Will Learn

By the end of this tutorial, you will have learned:

How to preprocess text data for summarization.
How to use BERT for extractive summarization.
How to work with JSON data in Python.
How to save the generated summaries to a CSV file.

Prerequisites

Before you begin, ensure you have the following installed:

Jupyter Notebook
Python with packages: bert-extractive-summarizer, sentencepiece, pandas, csv

You can install the required packages using the following commands:

pip install bert-extractive-summarizer
pip install sentencepiece
pip install pandas

Understanding the Provided Code

Let’s go through the provided code step by step.

1. Importing Required Libraries

import warnings
warnings.filterwarnings("ignore")

!pip install bert-extractive-summarizer
!pip install sentencepiece

import pandas as pd
import os

Warnings Filter: The code begins by suppressing any warnings to keep the output clean and focused.
Installing Packages: The !pip install commands ensure that the bert-extractive-summarizer and sentencepiece libraries are installed. These libraries are essential for using BERT for summarization and handling tokenization.
Importing Pandas: pandas is imported for handling data in a tabular format. The os library is imported but not used directly in the provided code.

2. Loading and Preparing the Data

data = pd.read_json('dev-stats.jsonl',lines = True,nrows= 3000)
data.head()

Loading Data: The dataset is loaded from a JSONL file (dev-stats.jsonl) using the pd.read_json function. The lines=True parameter specifies that each line in the file is a separate JSON object. The nrows=3000 argument limits the data to the first 3,000 rows.
Displaying Data: data.head() is used to display the first few rows of the dataset to inspect the structure.

Output

sub_data = pd.DataFrame(columns = ['text','summary'] )
sub_data['text'] = data.iloc[:,2] +' '+ data.iloc[:,4]
sub_data['summary'] = data.iloc[:,5]
sub_data.head()

Creating a Subset: A new DataFrame, sub_data, is created to store the text and corresponding summaries. The text data is formed by concatenating two columns from the original dataset, while the summary is taken from another column.
Inspecting the Subset: sub_data.head() displays the first few rows of the sub_data DataFrame to ensure the data has been processed correctly.

Output :

3. Summarizing the Text with BERT

from summarizer import Summarizer
model = Summarizer()

Importing and Initializing the Model: The Summarizer class from the bert-extractive-summarizer library is imported, and an instance of the BERT summarizer model is created.

texts = list(sub_data.text)
predicted_summary = []

for i in texts:
    result = model(i, min_length=50)
    summary = "".join(result)
    predicted_summary.append(summary)

Extracting Texts: The text data from sub_data is converted into a list called texts.
Generating Summaries: The code iterates over each text entry and uses the BERT model to generate a summary. The min_length=50 parameter ensures that the summary is at least 50 tokens long. The generated summary is then appended to the predicted_summary list.

Output

4. Preparing the Summaries for Export

predict_list = []
for i in predicted_summary:
    predict_list.append([i])

Formatting Summaries: The predicted_summary list is further processed into a predict_list where each summary is wrapped in a list. This is done to prepare the data for writing to a CSV file.

5. Writing the Summaries to a CSV File

import csv

with open('pred_extract.csv','w',encoding='utf8') as csv_file:
    write = csv.writer(csv_file, lineterminator='\n')
    write.writerow(['extractive_summary'])
    write.writerows(predict_list)

Writing to CSV: The summaries are written to a CSV file named pred_extract.csv. Each summary is stored under the column extractive_summary. The lineterminator='\n' ensures that each entry is written on a new line.

6. Inspecting the Output

predicted_summary[1]
sub_data.text[0]

Inspecting the Summaries: The code allows you to inspect the first summary in predicted_summary and the corresponding original text in sub_data to verify the model’s output.

predicted_summary[1]

Output

sub_data.text[0]

Output

Running the Code

To run the provided code:

Ensure that you have Jupyter Notebook installed and set up.
Copy the provided code into a Jupyter Notebook cell.
Run each cell sequentially.
Make sure the dataset (dev-stats.jsonl) is in the same directory or provide the correct path to the file.

Project Demo Video

In this blog, we walked through the process of creating an extractive text summarization model using BERT. We loaded a dataset, processed the text, used BERT to generate summaries, and saved the results to a CSV file. This project showcases the power of BERT in handling complex NLP tasks and provides a foundation for further exploration into text summarization.

If you require any assistance with this project or Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.

Creating an Extractive Text Summarization Model Using BERT

Introduction

What You Will Learn

Prerequisites

Understanding the Provided Code

1. Importing Required Libraries

2. Loading and Preparing the Data

3. Summarizing the Text with BERT

4. Preparing the Summaries for Export

5. Writing the Summaries to a CSV File

6. Inspecting the Output

Running the Code

Project Demo Video

Recent Posts

Opmerkingen