Named Entity Recognition (NER) is a crucial task in natural language processing (NLP) that involves identifying and classifying key information (entities) in text. These entities could be names of people, organizations, locations, or in this case, specific medical terms such as diseases. In this blog, we'll walk through the creation of a custom NER model using SpaCy, with the aid of transformer-based embeddings. The provided code is structured as a Jupyter Notebook and demonstrates how to train and evaluate a custom NER model on a medical dataset.
Introduction
The objective of this project is to build a custom NER model that can recognize specific medical entities in text, such as diseases and medical conditions. The model will be trained using SpaCy and transformer-based embeddings, specifically the en_core_web_trf transformer model from SpaCy.
What You Will Learn
By the end of this tutorial, you will learn:
How to set up a custom NER training pipeline using SpaCy and transformers.
How to prepare and preprocess your dataset for NER tasks.
How to train and evaluate the custom NER model on a specific dataset.
Prerequisites
Before starting, ensure you have the following installed:
Python
Jupyter Notebook
SpaCy (pip install spacy)
Transformers (pip install spacy-transformers)
Understanding the Provided Code
Let's go through the provided code step by step.
1. Setting Up the Environment
pip install --upgrade spacy
Upgrading SpaCy: The first line ensures that SpaCy is upgraded to the latest version to leverage the latest features and bug fixes.
Output :
Import libraries
import os
import pandas as pd
import numpy as np
import zipfile
Importing Libraries: os for file and directory management, pandas for handling data in a tabular format, numpy for numerical operations, and zipfile for handling zipped files.
Check GPU
! nvidia-smi
GPU Check: The nvidia-smi command checks if a compatible GPU is available for training, which can significantly speed up the process.
Output
2. Mounting Google Drive
from google.colab import drive
drive.mount('/gdrive')
# %cd /gdrive
Mounting Google Drive: This step is specific to Google Colab, where you mount your Google Drive to access files directly from it.
Changing Directory: The %cd command is used to navigate to the Google Drive directory, though it is commented out.
3. TensorFlow Setup
import tensorflow as tf
print(tf.__version__, tf.config.list_physical_devices('GPU'))
TensorFlow GPU Setup: The code checks the TensorFlow version and lists available GPU devices, ensuring that TensorFlow is set up correctly to utilize GPU acceleration.
Output :
4. Setting Up Directories
data_dir = os.path.join(os.getcwd(),'My Drive/A115 NLP')
os.chdir(data_dir)
dataset_dir= os.path.join(os.getcwd(),'NCBI-data (Part-2)')
os.chdir(dataset_dir)
output_dir = os.path.join(data_dir,'NER')
os.listdir()
Output :
5. Loading the Dataset
with open('NCBItrainset_corpus.txt','r') as trainfile:
file = trainfile.read()
with open('NCBItestset_corpus.txt','r') as testfile:
file_test = testfile.read()
Reading Files: The dataset is loaded from text files (NCBItrainset_corpus.txt and NCBItestset_corpus.txt) into Python strings for further processing.
6. Preprocessing the Data
# Train set
ctr=0
train= []
for line in file.split('\n\n'):
train.append(line)
# Test set
ctr=0
test= []
for line in file_test.split('\n\n'):
test.append(line)
Data Segmentation: The dataset is segmented into individual entries by splitting the text on double newline characters. This step organizes the data into manageable chunks for further processing.
7. Extracting Articles and Titles
import re
def get_article(text):
article = re.findall(r'a\|(.*)\n',text)
return article[0]
def get_title(text):
title = re.findall(r't\|(.*)\n',text)
return title[0]
# Articles for train data
train_article = [get_title(x)+' '+get_article(x) for x in train]
# Articles for test data
test_article = [get_title(x)+' '+get_article(x) for x in test]
len(test_article)
Regular Expressions for Extraction: The get_article and get_title functions use regular expressions to extract articles and titles from the dataset.
Concatenating Articles and Titles: The extracted titles and articles are concatenated to form complete documents for both training and testing datasets.
Output:
100
8. Creating DataFrames
train_df = pd.DataFrame(columns=['article'])
train_df['article'] = train_article
test_df = pd.DataFrame(columns=['article'])
test_df['article'] = test_article
DataFrames for Articles: The extracted articles are stored in train_df and test_df DataFrames, which will be used to associate articles with their corresponding labels.
train_df.head()
test.head()
Output :
train
test data
9. Extracting Labels
def get_labels(text):
l=re.findall(r'\t(.*)',text)
l = [x.split('\t') for x in l]
labels = []
for i in l:
labels.append((int(i[0]),int(i[1]),i[3]))
return labels
Label Extraction: The get_labels function uses regular expressions to extract entity labels from the dataset. These labels include the start and end positions of the entity in the text and the entity type.
train_labels = [get_labels(x) for x in train]
test_labels = [get_labels(x) for x in test]
Label Lists: The labels are then extracted and stored in lists corresponding to the training and testing datasets.
len(train_labels)
output: 593
len(test_labels)
output: 100
10. Adding Labels to DataFrames
train_df['labels'] = train_labels
test_df['labels'] = test_labels
Associating Labels: The extracted labels are added to the train_df and test_df DataFrames, linking the articles with their corresponding annotations.
train_df.head()
Output :
test_df.head()
output :
11. Preparing Training Data
training_data = []
for i, j in zip(train_article,train_labels):
training_data.append((i,j))
Training Data Preparation: The articles and labels are combined into tuples and stored in the training_data list. This will be used to train the custom NER model.
training_data[0]
Output:
12. Initializing SpaCy and Preparing for Training
import spacy
import spacy.training
import spacy_transformers
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_trf")
Loading SpaCy Model: The en_core_web_trf model is loaded, which is a transformer-based SpaCy model that leverages BERT for better contextual embeddings.
db = DocBin()
for text, annotations in training_data:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
if not span == None:
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk(os.path.join(output_dir, "train.spacy"))
Creating a Training Dataset: The code iterates through the training data, creating SpaCy Doc objects with annotated entities. These objects are then serialized and saved to disk using DocBin.
13. Preparing Testing Data
testing_data = []
for i, j in zip(test_article,test_labels):
testing_data.append((i,j))
Testing Data Preparation: Similar to training data, the articles and labels for the testing set are combined and stored in testing_data.
db = DocBin()
for text, annotations in testing_data:
doc = nlp(text)
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
if not span == None:
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk(os.path.join(output_dir, "dev.spacy"))
Creating a Testing Dataset: The testing data is processed similarly, with the resulting Doc objects saved for evaluation purposes.
14. Training the Custom NER Model
!pip install spacy-transformers
!python -m spacy download en_core_web_trf
Installing Necessary Packages: Ensure that spacy-transformers is installed and the transformer model is downloaded for training.
Output:
!python -m spacy init fill-config --help
Output :
!python -m spacy init fill-config '/gdrive/My Drive/A115 NLP/NER/base_config.cfg' '/gdrive/My Drive/A115 NLP/NER/config.cfg'
Filling Configuration: The spacy init fill-config command fills in the details for a base configuration file, creating a complete configuration for training
Output:
train_df
Outptut :
training_data[0]
Output :
!python -m spacy train '/gdrive/My Drive/A115 NLP/NER/config.cfg' --output '/gdrive/MyDrive/A115 NLP/NER/output' --paths.train '/gdrive/MyDrive/A115 NLP/NER/train.spacy' --paths.dev '/gdrive/MyDrive/A115 NLP/NER/dev.spacy' --gpu-id 0
Training the Model: The custom NER model is trained using the specified configuration file and dataset. The --gpu-id 0 flag ensures that the training runs on the first GPU.
Output :
15. Evaluating the Model
unique_labels = set()
for example in testing_data:
entities = example[1]
for entity in entities:
entity_label = entity[2]
unique_labels.add(entity_label)
unique_labels_list = list(unique_labels)
Extracting Unique Labels: The code extracts the unique entity labels present in the testing dataset for display purposes.
colors = {
"Modifier": "#FF69B4",
"CompositeMention": "#39FF14",
"SpecificDisease": "#FFFF00",
"DiseaseClass": "#87CEEB"
}
options = {"ents": ["SpecificDisease","Modifier","CompositeMention", "DiseaseClass"], "colors": colors}
Setting Visualization Options: Custom colors and entity types are defined for visualizing the entities recognized by the model.
print('Entities to be recognised in the provided medical text : ')
print(unique_labels_list)
Output :
import spacy
nlp = spacy.load('output/model-best')
doc = nlp(testing_data[5][0])
spacy.displacy.render(doc, style="ent", jupyter=True, options = options)
Loading and Testing the Model: The trained model is loaded, and a sample document from the testing dataset is processed. The recognized entities are then visualized directly in the Jupyter Notebook using SpaCy's displacy visualization tool.
Output :
Running the Code
To run this code:
Copy the provided code into a Jupyter Notebook.
Ensure that the necessary files and directories (e.g., datasets, configuration files) are correctly set up in Google Drive.
Execute each cell sequentially.
Monitor the training process and evaluate the model on the provided test data.
Project Demov Video
This blog post guided you through building a custom NER model using SpaCy and transformer-based embeddings. We started by preparing the dataset, creating training and testing datasets, and then trained a custom model capable of recognizing specific medical entities. The final step involved evaluating and visualizing the model’s performance, which showcased its ability to identify relevant entities in medical text.
If you require any assistance with this project or Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.
Comments