top of page

TEXT PROCESSING

Updated: Aug 20, 2022

INTROCUCTION

In this blog, you will be introduced to text processing where you will get to know about cleaning data using regular expressions, lemmatizing words, and using stop words to remove the unwanted words from the texts,


IMPLEMENTATION

To begin with, you will need to import essential libraries.

# IMPORT THE ESSENTIAL LIBRARIES
import numpy as np
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

import pandas as pd
import re
import string
import spacy

from string import punctuation

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
nltk.download('omw-1.4')
nlp = spacy.load('en_core_web_sm')
all_stopwords = nlp.Defaults.stop_word

Now, You will need to import the dataset to work with and to do that, I have used Coronavirus Tweets NLP dataset that you can download through this link: https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification

# IMPORT THE DATASET
dataset = pd.read_csv("/content/Corona_NLP_train.csv", encoding='latin-1')

To get the information about dataset, we will use .info() method

dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
#   Column         Non-Null Count  Dtype
---  ------         --------------  -----
0   UserName       41157 non-null  int64
1   ScreenName     41157 non-null  int64
2   Location       32567 non-null  object
3   TweetAt        41157 non-null  object
4   OriginalTweet  41157 non-null  object
5   Sentiment      41157 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB

In the above table, we need only TweetAt column that has the tweets related to COVID-19.

dataset = pd.DataFrame(dataset["OriginalTweet"])

Now, to find the total number of tweets, we will use .shape keyword

dataset.shape
(41157, 1)

For simplicity, I will use only 1000 rows

dataset = dataset.iloc[:1000, :]
dataset.head()

Now, we need to clean the data, and to do that, we will use regular expression which will remove the unwanted characters and words like punctuation marks and username from the tweets.

def  clean_text(text):
  text =  text.lower()
  text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", text)
  text = re.sub('@[^\s]+','',text)
  text = re.sub("(\$+)(?:(?!\1)[\s\S])*\1", '', text)
  text = re.sub("(?:table|figure)", '', text)
  text = re.sub("[A-Z]\w*(?: +\w+)*\. \(\d{4}\)", '', text)
  text = re.sub(r"i'm", "i am", text)
  text = re.sub(r"\r", "", text)
  text = re.sub(r"he's", "", text)
  text = re.sub(r"she's", "", text)
  text = re.sub(r"it's", "", text)
  text = re.sub(r"that's", "", text)
  text = re.sub(r"what's", "", text)
  text = re.sub(r"where's", "", text)
  text = re.sub(r"how's", "", text)
  text = re.sub(r"\'ll", "", text)
  text = re.sub(r"\'ve", "", text)
  text = re.sub(r"\'re", "", text)
  text = re.sub(r"\'d", "", text)
  text = re.sub(r"\'re", "", text)
  text = re.sub(r"won't", "", text)
  text = re.sub(r"can't", "", text)
  text = re.sub(r"n't", "", text)
  text = re.sub(r"n'", "ng", text)
  text = re.sub(r"'bout", "", text)
  text = re.sub(r"'til", "until", text)
  text = re.sub(r"\s{2,}", " ", text)
  text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)
  text = text.translate(str.maketrans('', '', string.punctuation)) 
  text = re.sub("(\\W)"," ",text) 
  text = re.sub('\S*\d\S*\s*','', text)
  text = re.sub(r"rt", "", text)
  return text

Now, it's time to clean the texts,

# TEXT CLEANING
dataset["OriginalTweet"] = dataset["OriginalTweet"].apply(clean_text)
dataset.head()

As you can see, all the usernames and punctuation marks have been removed.


Now, it is time to perform lemmatization that is used to convert the sentences to their base form which preserving their context.

lemmatizer = WordNetLemmatizer()
# APPLYING WORD LEMMATIZATION
def wordnet_lemmatization(df):
  for i in range(df.shape[0]):
    df.iat[i, 0]= ' '.join([lemmatizer.lemmatize(word) for word in df.iat[i, 0].split()])
  return df
  
# Let's call the function
wordnet_lemmatization(dataset)


Now, we will perform SpaCy lemmatization and see the difference,

# SPACY LEMMATIZATION
def spacy_lemmatization(df):
  # iterate over all the rows
  for i in range(df.shape[0]):
    tokens = []
    doc = nlp(df.iat[i, 0])
    for token in doc:
        tokens.append(token)

    df.iat[i, 0] = " ".join([token.lemma_ for token in doc])
  return df
  
spacy_lemmatization(dataset)

Now you will need to remove the stopwords that are the words that commonly occur in the text (at, the, it, in, on, to) and are no use for text models.

# REMOVE STOPWORDS
def remove_stopwords(df):
  for i in range(df.shape[0]):
    df.iat[i, 0] = ' '.join([word for word in df.iat[i, 0].split() if word not in all_stopwords])
  return df
  
remove_stopwords(dataset)
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

Comments


bottom of page