In this blog, you will be introduced to NLP and PySpark.
To begin with, we will first update the ubuntu packages so that we don't run into any error while installing Java.
!sudo apt update
Now, we will install the Java JDK.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
Now, we will install PySpark
# INSTALL APACHE SPARK AND HADOOP
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
Setting up the environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"
Search for the PySpark
!pip install -q findspark
import findspark
findspark.init()
Import the essential libraries
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import lit, array_remove
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import NGram
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import CountVectorizer
import pandas as pd
import psutil
import matplotlib.pyplot as plt
Configure the environment
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("NAME_OF_THE_APP")
.getOrCreate())
Download the dataset from this link: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Import the dataset
df = spark.read.csv('/content/IMDB Dataset.csv', sep=',', inferSchema=True, header=True)
df.show(5)
+--------------------+--------------------+
| review| sentiment|
+--------------------+--------------------+
|One of the other ...| positive|
|"A wonderful litt...| not only is it w...|
|"I thought this w...| but spirited you...|
|Basically there's...| negative|
|"Petter Mattei's ...| power and succes...|
+--------------------+--------------------+
only showing top 5 rows
Selecting only first 50 rows for text processing
df = df.select("review").limit(50)
Perform tokenization
tokenizer = Tokenizer(inputCol = 'review', outputCol = 'words')
regex_tokenizer = RegexTokenizer(inputCol = 'review', outputCol='words', pattern = '\\W')
tokenized = tokenizer.transform(df)
tokenized.show(5)
+--------------------+--------------------+
| review| words|
+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...|
|"A wonderful litt...|["a, wonderful, l...|
|"I thought this w...|["i, thought, thi...|
|Basically there's...|[basically, there...|
|"Petter Mattei's ...|["petter, mattei'...|
+--------------------+--------------------+
only showing top 5 rows
Count the number of tokens for each row.
count_tokens = udf(lambda words: len(words), IntegerType())
token_df = regex_tokenized.withColumn('token', count_tokens(col('words')))
token_df.show()
+--------------------+--------------------+------+
| review| words|tokens|
+--------------------+--------------------+------+
|One of the other ...|[one, of, the, ot...| 307|
|"A wonderful litt...|["a, wonderful, l...| 70|
|"I thought this w...|["i, thought, thi...| 130|
|Basically there's...|[basically, there...| 138|
|"Petter Mattei's ...|["petter, mattei'...| 37|
|"Probably my all-...|["probably, my, a...| 77|
|I sure would like...|[i, sure, would, ...| 150|
|This show was an ...|[this, show, was,...| 174|
|Encouraged by the...|[encouraged, by, ...| 130|
|If you like origi...|[if, you, like, o...| 33|
|"Phil the Alien i...|["phil, the, alie...| 96|
|I saw this movie ...|[i, saw, this, mo...| 180|
|"So im not a big ...|["so, im, not, a,...| 304|
|The cast played S...|[the, cast, playe...| 117|
|This a fantastic ...|[this, a, fantast...| 50|
|Kind of drawn in ...|[kind, of, drawn,...| 140|
|Some films just s...|[some, films, jus...| 146|
|This movie made i...|[this, movie, mad...| 228|
|I remember this f...|[i, remember, thi...| 129|
|An awful film! It...|[an, awful, film!...| 133|
+--------------------+--------------------+------+
only showing top 20 rows
Remove the stopwords
stopwords_remover = StopWordsRemover(inputCol = 'words', outputCol = 'filtered')
token_df = stopwords_remover .transform(token_df)
token_df.show()
+--------------------+--------------------+-----+--------------------+
| review| words|token| filtered|
+--------------------+--------------------+-----+--------------------+
|One of the other ...|[one, of, the, ot...| 320|[one, reviewers, ...|
|"A wonderful litt...|[a, wonderful, li...| 72|[wonderful, littl...|
|"I thought this w...|[i, thought, this...| 135|[thought, wonderf...|
|Basically there's...|[basically, there...| 141|[basically, famil...|
|"Petter Mattei's ...|[petter, mattei, ...| 38|[petter, mattei, ...|
|"Probably my all-...|[probably, my, al...| 80|[probably, time, ...|
|I sure would like...|[i, sure, would, ...| 161|[sure, like, see,...|
|This show was an ...|[this, show, was,...| 181|[show, amazing, f...|
|Encouraged by the...|[encouraged, by, ...| 130|[encouraged, posi...|
|If you like origi...|[if, you, like, o...| 34|[like, original, ...|
|"Phil the Alien i...|[phil, the, alien...| 101|[phil, alien, one...|
|I saw this movie ...|[i, saw, this, mo...| 184|[saw, movie, 12, ...|
|"So im not a big ...|[so, im, not, a, ...| 313|[im, big, fan, bo...|
|The cast played S...|[the, cast, playe...| 122|[cast, played, sh...|
|This a fantastic ...|[this, a, fantast...| 51|[fantastic, movie...|
|Kind of drawn in ...|[kind, of, drawn,...| 143|[kind, drawn, ero...|
|Some films just s...|[some, films, jus...| 147|[films, simply, r...|
|This movie made i...|[this, movie, mad...| 240|[movie, made, one...|
|I remember this f...|[i, remember, thi...| 131|[remember, film, ...|
|An awful film! It...|[an, awful, film,...| 139|[awful, film, mus...|
+--------------------+--------------------+-----+--------------------+
only showing top 20 rows
regex_tokenized = regex_tokenizer.transform(df)
token_df.withColumn('filtered_words',count_tokens(col('filtered'))).show()
+--------------------+--------------------+-----+--------------------+--------------+
| review| words|token| filtered|filtered_words|
+--------------------+--------------------+-----+--------------------+--------------+
|One of the other ...|[one, of, the, ot...| 320|[one, reviewers, ...| 174|
|"A wonderful litt...|[a, wonderful, li...| 72|[wonderful, littl...| 41|
|"I thought this w...|[i, thought, this...| 135|[thought, wonderf...| 71|
|Basically there's...|[basically, there...| 141|[basically, famil...| 73|
|"Petter Mattei's ...|[petter, mattei, ...| 38|[petter, mattei, ...| 22|
|"Probably my all-...|[probably, my, al...| 80|[probably, time, ...| 42|
|I sure would like...|[i, sure, would, ...| 161|[sure, like, see,...| 70|
|This show was an ...|[this, show, was,...| 181|[show, amazing, f...| 86|
|Encouraged by the...|[encouraged, by, ...| 130|[encouraged, posi...| 67|
|If you like origi...|[if, you, like, o...| 34|[like, original, ...| 19|
|"Phil the Alien i...|[phil, the, alien...| 101|[phil, alien, one...| 58|
|I saw this movie ...|[i, saw, this, mo...| 184|[saw, movie, 12, ...| 89|
|"So im not a big ...|[so, im, not, a, ...| 313|[im, big, fan, bo...| 180|
|The cast played S...|[the, cast, playe...| 122|[cast, played, sh...| 59|
|This a fantastic ...|[this, a, fantast...| 51|[fantastic, movie...| 27|
|Kind of drawn in ...|[kind, of, drawn,...| 143|[kind, drawn, ero...| 71|
|Some films just s...|[some, films, jus...| 147|[films, simply, r...| 54|
|This movie made i...|[this, movie, mad...| 240|[movie, made, one...| 130|
|I remember this f...|[i, remember, thi...| 131|[remember, film, ...| 65|
|An awful film! It...|[an, awful, film,...| 139|[awful, film, mus...| 63|
+--------------------+--------------------+-----+--------------------+--------------+
only showing top 20 rows
Performing n-grams
ngram = NGram(n =2, inputCol = 'filtered', outputCol = 'grams')
ngram.transform(token_df).select('grams').show(truncate = False)
+
|grams |
+------------------------------------------------------------------------------------------+
|[one reviewers, reviewers mentioned, mentioned watching, watching 1, 1 oz, oz episode, episode ll, ll hooked, hooked right, right exactly, exactly happened, happened br, br br, br first, first thing, thing struck, struck oz, oz brutality, brutality unflinching, unflinching scenes, scenes violence, violence set, set right, right word, word go, go trust, trust show, show faint, faint hearted, hearted timid, timid show, show pulls, pulls punches, punches regards, regards drugs, drugs sex, sex violence, violence hardcore, hardcore classic, classic use, use word, word br, br br, br called, called oz, oz nickname, nickname given, given oswald, oswald maximum, maximum security, security state, state penitentary, penitentary focuses, focuses mainly, mainly emerald, emerald city, city experimental, experimental section, section prison, prison cells, cells glass, glass fronts, fronts face, face inwards, inwards privacy, privacy high, high agenda, agenda em, em city, city home, home many, many aryans, aryans, gangstas latinos, latinos christians, christians italians, italians irish, irish scuffles, scuffles death, death stares, stares dodgy, dodgy dealings, dealings shady, shady agreements, agreements never, never far, far away, away br, br br, br say, say main, main appeal, appeal show, show due, due fact, fact goes, goes shows, shows wouldn, wouldn dare, dare forget, forget pretty, pretty pictures, pictures painted, painted mainstream, mainstream audiences, audiences forget, forget charm, charm forget, forget romance, romance oz, oz doesn, doesn mess, mess around, around first, first episode, episode ever, ever saw, saw struck, struck nasty, nasty surreal, surreal couldn, couldn say, say ready, ready watched, watched developed, developed taste, taste oz, oz got, got accustomed, accustomed high, high levels, levels graphic, graphic violence, violence violence, violence injustice, injustice crooked, crooked guards, guards ll, ll sold, sold nickel, nickel inmates, inmates ll, ll kill, kill order, order get, get away, away well, well mannered, mannered middle, middle class, class inmates, inmates turned, turned prison, prison bitches, bitches due, due lack, lack street, street skills, skills prison, prison experience, experience watching, watching oz, oz may, may become, become comfortable, comfortable uncomfortable, uncomfortable viewing, viewing thats, thats get, get touch, touch darker, darker side]|
|[awful film, film must, must real, real stinkers, stinkers nominated, nominated golden, golden globe, globe ve, ve taken, taken story, story first, first famous, famous female, female renaissance, renaissance painter, painter mangled, mangled beyond, beyond recognition, recognition complaint, complaint ve, ve taken, taken liberties, liberties facts, facts story, story good, good perfectly, perfectly fine, fine simply, simply bizarre, bizarre accounts, accounts true, true story, story artist, artist made, made far, far better, better film, film come, come dishwater, dishwater dull, dull script, script suppose, suppose weren, weren enough, enough naked, naked people, people factual, factual version, version hurriedly, hurriedly capped, capped end, end summary, summary artist, artist life, life saved, saved couple, couple hours, hours d, d favored, favored rest, rest film, film brevity] |
-------------------------------------------------------------------------------------------+
only showing top 20 rows
Performing TF-IDF
hashing_tf = HashingTF(inputCol='filtered', outputCol='rawFeatures')
featurized_data = hashing_tf.transform(token_df)
featurized_data.show(5)
+--------------------+--------------------+-----+--------------------+--------------------+
| review| words|token| filtered| rawFeatures|
+--------------------+--------------------+-----+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...| 320|[one, reviewers, ...|(262144,[2325,328...|
|"A wonderful litt...|[a, wonderful, li...| 72|[wonderful, littl...|(262144,[3928,655...|
|"I thought this w...|[i, thought, this...| 135|[thought, wonderf...|(262144,[1043,139...|
|Basically there's...|[basically, there...| 141|[basically, famil...|(262144,[6512,853...|
|"Petter Mattei's ...|[petter, mattei, ...| 38|[petter, mattei, ...|(262144,[4319,172...|
+--------------------+--------------------+-----+--------------------+--------------------+
only showing top 5 rows
idf = IDF(inputCol='rawFeatures', outputCol='features')
idf_model = idf.fit(featurized_data)
rescaled_data = idf_model.transform(featurized_data)
rescaled_data.select('features').show(5)
+--------------------+
| features|
+--------------------+
|(262144,[2325,328...|
|(262144,[3928,655...|
|(262144,[1043,139...|
|(262144,[6512,853...|
|(262144,[4319,172...|
+--------------------+
only showing top 5 rows
Performing Count Vectorizer
cv = CountVectorizer(inputCol ='filtered', outputCol = 'vectorized_features', minDF= 2.0)
model = cv.fit(featurized_data)
vectorized_features = model.transform(featurized_data)
vectorized_features.show(5)
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
| review| words|token| filtered| rawFeatures| vectorized_features|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
|One of the other ...|[one, of, the, ot...| 320|[one, reviewers, ...|(262144,[2325,328...|(484,[0,3,8,13,14...|
|"A wonderful litt...|[a, wonderful, li...| 72|[wonderful, littl...|(262144,[3928,655...|(484,[0,6,7,19,23...|
|"I thought this w...|[i, thought, this...| 135|[thought, wonderf...|(262144,[1043,139...|(484,[0,3,5,7,13,...|
|Basically there's...|[basically, there...| 141|[basically, famil...|(262144,[6512,853...|(484,[0,1,2,4,6,7...|
|"Petter Mattei's ...|[petter, mattei, ...| 38|[petter, mattei, ...|(262144,[4319,172...|(484,[1,2,7,34,75...|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+
only showing top 5 rows
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
Comments