Extractive summarization is a natural language processing (NLP) technique where the most important sentences or phrases are extracted from a text to create a concise summary. This blog will guide you through building an extractive text summarization model using BERT, a state-of-the-art NLP model. We will walk through the provided code in a Jupyter Notebook (Bert_Extractive.ipynb) and explain each component in detail.
Introduction
The goal of this project is to develop an extractive summarization model using BERT, which will take a large text input and extract the most relevant sentences to create a summary. We will use a pre-trained BERT model from the bert-extractive-summarizer library to achieve this. The code also demonstrates how to preprocess the data, generate summaries, and save the results to a CSV file.
What You Will Learn
By the end of this tutorial, you will have learned:
How to preprocess text data for summarization.
How to use BERT for extractive summarization.
How to work with JSON data in Python.
How to save the generated summaries to a CSV file.
Prerequisites
Before you begin, ensure you have the following installed:
Jupyter Notebook
Python with packages: bert-extractive-summarizer, sentencepiece, pandas, csv
You can install the required packages using the following commands:
pip install bert-extractive-summarizer
pip install sentencepiece
pip install pandas
Understanding the Provided Code
Let’s go through the provided code step by step.
1. Importing Required Libraries
import warnings
warnings.filterwarnings("ignore")
!pip install bert-extractive-summarizer
!pip install sentencepiece
import pandas as pd
import os
Warnings Filter: The code begins by suppressing any warnings to keep the output clean and focused.
Installing Packages: The !pip install commands ensure that the bert-extractive-summarizer and sentencepiece libraries are installed. These libraries are essential for using BERT for summarization and handling tokenization.
Importing Pandas: pandas is imported for handling data in a tabular format. The os library is imported but not used directly in the provided code.
2. Loading and Preparing the Data
data = pd.read_json('dev-stats.jsonl',lines = True,nrows= 3000)
data.head()
Loading Data: The dataset is loaded from a JSONL file (dev-stats.jsonl) using the pd.read_json function. The lines=True parameter specifies that each line in the file is a separate JSON object. The nrows=3000 argument limits the data to the first 3,000 rows.
Displaying Data: data.head() is used to display the first few rows of the dataset to inspect the structure.
Output
sub_data = pd.DataFrame(columns = ['text','summary'] )
sub_data['text'] = data.iloc[:,2] +' '+ data.iloc[:,4]
sub_data['summary'] = data.iloc[:,5]
sub_data.head()
Creating a Subset: A new DataFrame, sub_data, is created to store the text and corresponding summaries. The text data is formed by concatenating two columns from the original dataset, while the summary is taken from another column.
Inspecting the Subset: sub_data.head() displays the first few rows of the sub_data DataFrame to ensure the data has been processed correctly.
Output :
3. Summarizing the Text with BERT
from summarizer import Summarizer
model = Summarizer()
Importing and Initializing the Model: The Summarizer class from the bert-extractive-summarizer library is imported, and an instance of the BERT summarizer model is created.
texts = list(sub_data.text)
predicted_summary = []
for i in texts:
result = model(i, min_length=50)
summary = "".join(result)
predicted_summary.append(summary)
Extracting Texts: The text data from sub_data is converted into a list called texts.
Generating Summaries: The code iterates over each text entry and uses the BERT model to generate a summary. The min_length=50 parameter ensures that the summary is at least 50 tokens long. The generated summary is then appended to the predicted_summary list.
Output
4. Preparing the Summaries for Export
predict_list = []
for i in predicted_summary:
predict_list.append([i])
Formatting Summaries: The predicted_summary list is further processed into a predict_list where each summary is wrapped in a list. This is done to prepare the data for writing to a CSV file.
5. Writing the Summaries to a CSV File
import csv
with open('pred_extract.csv','w',encoding='utf8') as csv_file:
write = csv.writer(csv_file, lineterminator='\n')
write.writerow(['extractive_summary'])
write.writerows(predict_list)
Writing to CSV: The summaries are written to a CSV file named pred_extract.csv. Each summary is stored under the column extractive_summary. The lineterminator='\n' ensures that each entry is written on a new line.
6. Inspecting the Output
predicted_summary[1]
sub_data.text[0]
Inspecting the Summaries: The code allows you to inspect the first summary in predicted_summary and the corresponding original text in sub_data to verify the model’s output.
predicted_summary[1]
Output
sub_data.text[0]
Output
Running the Code
To run the provided code:
Ensure that you have Jupyter Notebook installed and set up.
Copy the provided code into a Jupyter Notebook cell.
Run each cell sequentially.
Make sure the dataset (dev-stats.jsonl) is in the same directory or provide the correct path to the file.
Project Demo Video
In this blog, we walked through the process of creating an extractive text summarization model using BERT. We loaded a dataset, processed the text, used BERT to generate summaries, and saved the results to a CSV file. This project showcases the power of BERT in handling complex NLP tasks and provides a foundation for further exploration into text summarization.
If you require any assistance with this project or Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.
Comments