top of page

NLP WITH PYSPARK

In this blog, you will be introduced to NLP and PySpark.


To begin with, we will first update the ubuntu packages so that we don't run into any error while installing Java.

!sudo apt update

Now, we will install the Java JDK.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Now, we will install PySpark

# INSTALL APACHE SPARK AND HADOOP
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz

Setting up the environment variables

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

Search for the PySpark

!pip install -q findspark
import findspark
findspark.init()

Import the essential libraries

from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import lit, array_remove
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import NGram
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import CountVectorizer
import pandas as pd
import psutil 
import matplotlib.pyplot as plt

Configure the environment

from pyspark.sql import SparkSession
spark = (SparkSession
 .builder
 .appName("NAME_OF_THE_APP")
 .getOrCreate())

Import the dataset

df = spark.read.csv('/content/IMDB Dataset.csv', sep=',', inferSchema=True, header=True)
df.show(5)
+--------------------+--------------------+
|              review|           sentiment|
+--------------------+--------------------+
|One of the other ...|            positive|
|"A wonderful litt...| not only is it w...|
|"I thought this w...| but spirited you...|
|Basically there's...|            negative|
|"Petter Mattei's ...| power and succes...|
+--------------------+--------------------+
only showing top 5 rows

Selecting only first 50 rows for text processing

df = df.select("review").limit(50)

Perform tokenization

tokenizer = Tokenizer(inputCol = 'review', outputCol = 'words')
regex_tokenizer = RegexTokenizer(inputCol = 'review', outputCol='words', pattern = '\\W')
tokenized = tokenizer.transform(df)
tokenized.show(5)
+--------------------+--------------------+
|              review|               words|
+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...|
|"A wonderful litt...|["a, wonderful, l...|
|"I thought this w...|["i, thought, thi...|
|Basically there's...|[basically, there...|
|"Petter Mattei's ...|["petter, mattei'...|
+--------------------+--------------------+
only showing top 5 rows

Count the number of tokens for each row.

count_tokens = udf(lambda words: len(words), IntegerType())
token_df = regex_tokenized.withColumn('token', count_tokens(col('words')))
token_df.show()
+--------------------+--------------------+------+
|              review|               words|tokens|
+--------------------+--------------------+------+
|One of the other ...|[one, of, the, ot...|   307|
|"A wonderful litt...|["a, wonderful, l...|    70|
|"I thought this w...|["i, thought, thi...|   130|
|Basically there's...|[basically, there...|   138|
|"Petter Mattei's ...|["petter, mattei'...|    37|
|"Probably my all-...|["probably, my, a...|    77|
|I sure would like...|[i, sure, would, ...|   150|
|This show was an ...|[this, show, was,...|   174|
|Encouraged by the...|[encouraged, by, ...|   130|
|If you like origi...|[if, you, like, o...|    33|
|"Phil the Alien i...|["phil, the, alie...|    96|
|I saw this movie ...|[i, saw, this, mo...|   180|
|"So im not a big ...|["so, im, not, a,...|   304|
|The cast played S...|[the, cast, playe...|   117|
|This a fantastic ...|[this, a, fantast...|    50|
|Kind of drawn in ...|[kind, of, drawn,...|   140|
|Some films just s...|[some, films, jus...|   146|
|This movie made i...|[this, movie, mad...|   228|
|I remember this f...|[i, remember, thi...|   129|
|An awful film! It...|[an, awful, film!...|   133|
+--------------------+--------------------+------+
only showing top 20 rows

Remove the stopwords

stopwords_remover = StopWordsRemover(inputCol = 'words', outputCol = 'filtered')
token_df = stopwords_remover .transform(token_df)
token_df.show()
+--------------------+--------------------+-----+--------------------+
|              review|               words|token|            filtered|
+--------------------+--------------------+-----+--------------------+
|One of the other ...|[one, of, the, ot...|  320|[one, reviewers, ...|
|"A wonderful litt...|[a, wonderful, li...|   72|[wonderful, littl...|
|"I thought this w...|[i, thought, this...|  135|[thought, wonderf...|
|Basically there's...|[basically, there...|  141|[basically, famil...|
|"Petter Mattei's ...|[petter, mattei, ...|   38|[petter, mattei, ...|
|"Probably my all-...|[probably, my, al...|   80|[probably, time, ...|
|I sure would like...|[i, sure, would, ...|  161|[sure, like, see,...|
|This show was an ...|[this, show, was,...|  181|[show, amazing, f...|
|Encouraged by the...|[encouraged, by, ...|  130|[encouraged, posi...|
|If you like origi...|[if, you, like, o...|   34|[like, original, ...|
|"Phil the Alien i...|[phil, the, alien...|  101|[phil, alien, one...|
|I saw this movie ...|[i, saw, this, mo...|  184|[saw, movie, 12, ...|
|"So im not a big ...|[so, im, not, a, ...|  313|[im, big, fan, bo...|
|The cast played S...|[the, cast, playe...|  122|[cast, played, sh...|
|This a fantastic ...|[this, a, fantast...|   51|[fantastic, movie...|
|Kind of drawn in ...|[kind, of, drawn,...|  143|[kind, drawn, ero...|
|Some films just s...|[some, films, jus...|  147|[films, simply, r...|
|This movie made i...|[this, movie, mad...|  240|[movie, made, one...|
|I remember this f...|[i, remember, thi...|  131|[remember, film, ...|
|An awful film! It...|[an, awful, film,...|  139|[awful, film, mus...|
+--------------------+--------------------+-----+--------------------+
only showing top 20 rows


regex_tokenized = regex_tokenizer.transform(df)
token_df.withColumn('filtered_words',count_tokens(col('filtered'))).show()
+--------------------+--------------------+-----+--------------------+--------------+
|              review|               words|token|            filtered|filtered_words|
+--------------------+--------------------+-----+--------------------+--------------+
|One of the other ...|[one, of, the, ot...|  320|[one, reviewers, ...|           174|
|"A wonderful litt...|[a, wonderful, li...|   72|[wonderful, littl...|            41|
|"I thought this w...|[i, thought, this...|  135|[thought, wonderf...|            71|
|Basically there's...|[basically, there...|  141|[basically, famil...|            73|
|"Petter Mattei's ...|[petter, mattei, ...|   38|[petter, mattei, ...|            22|
|"Probably my all-...|[probably, my, al...|   80|[probably, time, ...|            42|
|I sure would like...|[i, sure, would, ...|  161|[sure, like, see,...|            70|
|This show was an ...|[this, show, was,...|  181|[show, amazing, f...|            86|
|Encouraged by the...|[encouraged, by, ...|  130|[encouraged, posi...|            67|
|If you like origi...|[if, you, like, o...|   34|[like, original, ...|            19|
|"Phil the Alien i...|[phil, the, alien...|  101|[phil, alien, one...|            58|
|I saw this movie ...|[i, saw, this, mo...|  184|[saw, movie, 12, ...|            89|
|"So im not a big ...|[so, im, not, a, ...|  313|[im, big, fan, bo...|           180|
|The cast played S...|[the, cast, playe...|  122|[cast, played, sh...|            59|
|This a fantastic ...|[this, a, fantast...|   51|[fantastic, movie...|            27|
|Kind of drawn in ...|[kind, of, drawn,...|  143|[kind, drawn, ero...|            71|
|Some films just s...|[some, films, jus...|  147|[films, simply, r...|            54|
|This movie made i...|[this, movie, mad...|  240|[movie, made, one...|           130|
|I remember this f...|[i, remember, thi...|  131|[remember, film, ...|            65|
|An awful film! It...|[an, awful, film,...|  139|[awful, film, mus...|            63|
+--------------------+--------------------+-----+--------------------+--------------+
only showing top 20 rows


Performing n-grams

ngram = NGram(n =2, inputCol = 'filtered', outputCol = 'grams')
ngram.transform(token_df).select('grams').show(truncate = False)
+
|grams                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+------------------------------------------------------------------------------------------+
|[one reviewers, reviewers mentioned, mentioned watching, watching 1, 1 oz, oz episode, episode ll, ll hooked, hooked right, right exactly, exactly happened, happened br, br br, br first, first thing, thing struck, struck oz, oz brutality, brutality unflinching, unflinching scenes, scenes violence, violence set, set right, right word, word go, go trust, trust show, show faint, faint hearted, hearted timid, timid show, show pulls, pulls punches, punches regards, regards drugs, drugs sex, sex violence, violence hardcore, hardcore classic, classic use, use word, word br, br br, br called, called oz, oz nickname, nickname given, given oswald, oswald maximum, maximum security, security state, state penitentary, penitentary focuses, focuses mainly, mainly emerald, emerald city, city experimental, experimental section, section prison, prison cells, cells glass, glass fronts, fronts face, face inwards, inwards privacy, privacy high, high agenda, agenda em, em city, city home, home many, many aryans, aryans, gangstas latinos, latinos christians, christians italians, italians irish, irish scuffles, scuffles death, death stares, stares dodgy, dodgy dealings, dealings shady, shady agreements, agreements never, never far, far away, away br, br br, br say, say main, main appeal, appeal show, show due, due fact, fact goes, goes shows, shows wouldn, wouldn dare, dare forget, forget pretty, pretty pictures, pictures painted, painted mainstream, mainstream audiences, audiences forget, forget charm, charm forget, forget romance, romance oz, oz doesn, doesn mess, mess around, around first, first episode, episode ever, ever saw, saw struck, struck nasty, nasty surreal, surreal couldn, couldn say, say ready, ready watched, watched developed, developed taste, taste oz, oz got, got accustomed, accustomed high, high levels, levels graphic, graphic violence, violence violence, violence injustice, injustice crooked, crooked guards, guards ll, ll sold, sold nickel, nickel inmates, inmates ll, ll kill, kill order, order get, get away, away well, well mannered, mannered middle, middle class, class inmates, inmates turned, turned prison, prison bitches, bitches due, due lack, lack street, street skills, skills prison, prison experience, experience watching, watching oz, oz may, may become, become comfortable, comfortable uncomfortable, uncomfortable viewing, viewing thats, thats get, get touch, touch darker, darker side]|

|[awful film, film must, must real, real stinkers, stinkers nominated, nominated golden, golden globe, globe ve, ve taken, taken story, story first, first famous, famous female, female renaissance, renaissance painter, painter mangled, mangled beyond, beyond recognition, recognition complaint, complaint ve, ve taken, taken liberties, liberties facts, facts story, story good, good perfectly, perfectly fine, fine simply, simply bizarre, bizarre accounts, accounts true, true story, story artist, artist made, made far, far better, better film, film come, come dishwater, dishwater dull, dull script, script suppose, suppose weren, weren enough, enough naked, naked people, people factual, factual version, version hurriedly, hurriedly capped, capped end, end summary, summary artist, artist life, life saved, saved couple, couple hours, hours d, d favored, favored rest, rest film, film brevity]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
-------------------------------------------------------------------------------------------+
only showing top 20 rows

Performing TF-IDF

hashing_tf = HashingTF(inputCol='filtered', outputCol='rawFeatures')
featurized_data = hashing_tf.transform(token_df)
featurized_data.show(5)
+--------------------+--------------------+-----+--------------------+--------------------+
|              review|               words|token|            filtered|         rawFeatures|
+--------------------+--------------------+-----+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...|  320|[one, reviewers, ...|(262144,[2325,328...|
|"A wonderful litt...|[a, wonderful, li...|   72|[wonderful, littl...|(262144,[3928,655...|
|"I thought this w...|[i, thought, this...|  135|[thought, wonderf...|(262144,[1043,139...|
|Basically there's...|[basically, there...|  141|[basically, famil...|(262144,[6512,853...|
|"Petter Mattei's ...|[petter, mattei, ...|   38|[petter, mattei, ...|(262144,[4319,172...|
+--------------------+--------------------+-----+--------------------+--------------------+
only showing top 5 rows
idf = IDF(inputCol='rawFeatures', outputCol='features')
idf_model = idf.fit(featurized_data)
rescaled_data = idf_model.transform(featurized_data)
rescaled_data.select('features').show(5)
+--------------------+
|            features|
+--------------------+
|(262144,[2325,328...|
|(262144,[3928,655...|
|(262144,[1043,139...|
|(262144,[6512,853...|
|(262144,[4319,172...|
+--------------------+
only showing top 5 rows

Performing Count Vectorizer

cv = CountVectorizer(inputCol ='filtered', outputCol = 'vectorized_features', minDF= 2.0)
model = cv.fit(featurized_data)
vectorized_features = model.transform(featurized_data)
vectorized_features.show(5)
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|              review|               words|token|            filtered|         rawFeatures| vectorized_features|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...|  320|[one, reviewers, ...|(262144,[2325,328...|(484,[0,3,8,13,14...|
|"A wonderful litt...|[a, wonderful, li...|   72|[wonderful, littl...|(262144,[3928,655...|(484,[0,6,7,19,23...|
|"I thought this w...|[i, thought, this...|  135|[thought, wonderf...|(262144,[1043,139...|(484,[0,3,5,7,13,...|
|Basically there's...|[basically, there...|  141|[basically, famil...|(262144,[6512,853...|(484,[0,1,2,4,6,7...|
|"Petter Mattei's ...|[petter, mattei, ...|   38|[petter, mattei, ...|(262144,[4319,172...|(484,[1,2,7,34,75...|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
only showing top 5 rows
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

Comments


bottom of page