Introduction
Apache Spark is an open source, distributed processing engine that allows the users to process the large data in a distributed manner. PySpark is a Python interface for Apache Spark as Apache Spark is originally written in Scala.
PySpark performs operations faster than Pandas as PySpark does parallel execution on all the cores of multiple machines.
In this blog, you will be introduced to the installation of PySpark on Colab Notebook and how to perform some operations through it.
Configure the Environment for PySpark
To begin with, we need to install Java JDK as the Spark is written in Scala which is based on Java. But even before that, we need to update the local versions of the ubuntu's package catalog so that we can effeciently download the Java JDK on the Colab.
To update the packages, we will use sudo apt update command.
!sudo apt update
Now, we are ready to download the Java JDK.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
After that, we will install Apache Spark and Hadoop. Here we are using Apache Spark 3.2.1 and Hadoop 2.7.
You can select the versions on which you want to work with. You can download the latest version of the software through this link: Download Spark
!wget -q https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
We will now extract the compressed 'spark-3.2.1-bin-hadoop2.7.tgz' file that we have downloaded from the previous command.
!tar xf spark-3.2.1-bin-hadoop2.7.tgz
Now it's time to configure the environment variables for Java JDK and PySpark.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop2.7"
Then we will install the SQLite JDBC Drive so that we use of SQLite database in our program. If you want to use the latest JDBC, you can download it using following link: SQLite JDBC Driver
!wget -q https://repo1.maven.org/maven2/org/xerial/sqlite-jdbc/3.36.0.3/sqlite-jdbc-3.36.0.3.jar -P /content/spark-3.2.1-bin-hadoop2.7/jars
Now, we will look for the PySpark library so that we can import it in the program.
!pip install -q findspark import findspark findspark.init()
Now, it's time to import the PySpark essential libraries
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions
import lit, array_remove
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("ADVERTISEMENTS")
.config("/content/spark-3.2.1-bin-hadoop2.7/jars", "sqlite-jdbc-3.36.0.3.jar")
.getOrCreate())
Data Analysis
First, we will import the essential libraries.
import sqlite3 import pandas as pd
import psutil
It's time to import the dataset.
social_network_pd = pd.read_csv("/content/Social_Network_Ads.csv")
Creating a database named as information.db and at the same time extablishing a connection to the database.
And after that, uploading the data that have been loaded through the pandas into the database information.db and later this database will be analysed using Spark.
try:
# Creating and connecting database
connection = sqlite3.connect("information.db")
# Loading the dataset into sql database
social_network_pd.to_sql("social_network_pd", connection, if_exists =
'replace', index = False) # Connecting PySpark with the sqllite
database. social_network_df = spark.read.format("jdbc").option("url",
"jdbc:sqlite:information.db")\
.option("dbtable",
"social_network_pd")\
.option("driver","org.sqlite.JDBC")\
.option("user", "root")\
.option("password", "passkey")\
.load()
# creating view instances of each table
social_network_df.createOrReplaceTempView("social_network_df_view") except:
print("Error: Error generated in try space. Make sure that name of the
\ database file is correct, or make sure that the SQLite JDBC driver
is available.")
To get the information about the schema, we will use .printSchema() function.
social_network_df.printSchema()
root
|-- User ID: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: integer (nullable = true)
|-- EstimatedSalary: integer (nullable = true)
|-- Purchased: integer (nullable = true)
Now, it's time to get some insights about the data. To get the top-five rows of the dataset, we will use .show(number_of_rows) function. This is an equivalent to .head() function in Pandas.
social_network_df.show(5)
+--------+------+---+---------------+---------+
| User ID|Gender|Age|EstimatedSalary|Purchased|
+--------+------+---+---------------+---------+
|15624510| Male| 19| 19000| 0|
|15810944| Male| 35| 20000| 0|
|15668575|Female| 26| 43000| 0|
|15603246|Female| 27| 57000| 0|
|15804002| Male| 19| 76000| 0|
+--------+------+---+---------------+---------+
only showing top 5 rows
To get the total number of rows, we will use .count() function.
social_network_df.count()
Now, we can also explore the view table which we have created earlier in this notebook.
top_ten_rows = spark.sql("\
SELECT *\
FROM social_network_df_view\
LIMIT 10")
Now, we can view the first-ten values of the dataset.
top_ten_rows.show()
+--------+------+---+---------------+---------+
| User ID|Gender|Age|EstimatedSalary|Purchased|
+--------+------+---+---------------+---------+
|15624510| Male| 19| 19000| 0|
|15810944| Male| 35| 20000| 0|
|15668575|Female| 26| 43000| 0|
|15603246|Female| 27| 57000| 0|
|15804002| Male| 19| 76000| 0|
|15728773| Male| 27| 58000| 0|
|15598044|Female| 27| 84000| 0|
|15694829|Female| 32| 150000| 1|
|15600575| Male| 25| 33000| 0|
|15727311|Female| 35| 65000| 0|
+--------+------+---+---------------+---------+
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.
Comments