INTROCUCTION
In this blog, you will be introduced to text processing where you will get to know about cleaning data using regular expressions, lemmatizing words, and using stop words to remove the unwanted words from the texts,
IMPLEMENTATION
To begin with, you will need to import essential libraries.
# IMPORT THE ESSENTIAL LIBRARIES
import numpy as np
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer
import pandas as pd
import re
import string
import spacy
from string import punctuation
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
nltk.download('omw-1.4')
nlp = spacy.load('en_core_web_sm')
all_stopwords = nlp.Defaults.stop_word
Now, You will need to import the dataset to work with and to do that, I have used Coronavirus Tweets NLP dataset that you can download through this link: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification
# IMPORT THE DATASET
dataset = pd.read_csv("/content/Corona_NLP_train.csv", encoding='latin-1')
To get the information about dataset, we will use .info() method
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserName 41157 non-null int64
1 ScreenName 41157 non-null int64
2 Location 32567 non-null object
3 TweetAt 41157 non-null object
4 OriginalTweet 41157 non-null object
5 Sentiment 41157 non-null object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB
In the above table, we need only TweetAt column that has the tweets related to COVID-19.
dataset = pd.DataFrame(dataset["OriginalTweet"])
Now, to find the total number of tweets, we will use .shape keyword
dataset.shape
(41157, 1)
For simplicity, I will use only 1000 rows
dataset = dataset.iloc[:1000, :]
dataset.head()
Now, we need to clean the data, and to do that, we will use regular expression which will remove the unwanted characters and words like punctuation marks and username from the tweets.
def clean_text(text):
text = text.lower()
text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", text)
text = re.sub('@[^\s]+','',text)
text = re.sub("(\$+)(?:(?!\1)[\s\S])*\1", '', text)
text = re.sub("(?:table|figure)", '', text)
text = re.sub("[A-Z]\w*(?: +\w+)*\. \(\d{4}\)", '', text)
text = re.sub(r"i'm", "i am", text)
text = re.sub(r"\r", "", text)
text = re.sub(r"he's", "", text)
text = re.sub(r"she's", "", text)
text = re.sub(r"it's", "", text)
text = re.sub(r"that's", "", text)
text = re.sub(r"what's", "", text)
text = re.sub(r"where's", "", text)
text = re.sub(r"how's", "", text)
text = re.sub(r"\'ll", "", text)
text = re.sub(r"\'ve", "", text)
text = re.sub(r"\'re", "", text)
text = re.sub(r"\'d", "", text)
text = re.sub(r"\'re", "", text)
text = re.sub(r"won't", "", text)
text = re.sub(r"can't", "", text)
text = re.sub(r"n't", "", text)
text = re.sub(r"n'", "ng", text)
text = re.sub(r"'bout", "", text)
text = re.sub(r"'til", "until", text)
text = re.sub(r"\s{2,}", " ", text)
text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)
text = text.translate(str.maketrans('', '', string.punctuation))
text = re.sub("(\\W)"," ",text)
text = re.sub('\S*\d\S*\s*','', text)
text = re.sub(r"rt", "", text)
return text
Now, it's time to clean the texts,
# TEXT CLEANING
dataset["OriginalTweet"] = dataset["OriginalTweet"].apply(clean_text)
dataset.head()
As you can see, all the usernames and punctuation marks have been removed.
Now, it is time to perform lemmatization that is used to convert the sentences to their base form which preserving their context.
lemmatizer = WordNetLemmatizer()
# APPLYING WORD LEMMATIZATION
def wordnet_lemmatization(df):
for i in range(df.shape[0]):
df.iat[i, 0]= ' '.join([lemmatizer.lemmatize(word) for word in df.iat[i, 0].split()])
return df
# Let's call the function
wordnet_lemmatization(dataset)
Now, we will perform SpaCy lemmatization and see the difference,
# SPACY LEMMATIZATION
def spacy_lemmatization(df):
# iterate over all the rows
for i in range(df.shape[0]):
tokens = []
doc = nlp(df.iat[i, 0])
for token in doc:
tokens.append(token)
df.iat[i, 0] = " ".join([token.lemma_ for token in doc])
return df
spacy_lemmatization(dataset)
Now you will need to remove the stopwords that are the words that commonly occur in the text (at, the, it, in, on, to) and are no use for text models.
# REMOVE STOPWORDS
def remove_stopwords(df):
for i in range(df.shape[0]):
df.iat[i, 0] = ' '.join([word for word in df.iat[i, 0].split() if word not in all_stopwords])
return df
remove_stopwords(dataset)
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
Comments