Prerequisite :
You must have python 3.7 or more installed on your system.
You must have hadoop and pyspark installed on your system
You must have a Spyder, Jupyter notebook on your system. Spyder or jupyter notebook come up with anaconda. you just need to launch them after installing anaconda.
If you work on a google colab no need to install python or Any other IDE, you just need to sign in with google colab and install pyspark using “!pip install pyspark” this command.
Skilled required:
Python programming language
Basic Statistical analysis skills
Machine learning concept
What you’ll learn
How to read the data using pyspark dataframe
Perform Basic Exploratory Data analysis using pyspark
How to calculate the silhouette score
How to create cluster by apply k-means algorithm using pyspark
Problem Statement or Description:
This project will show how to create a cluster of data by applying k-means algorithms using pyspark on animal milk dataset. In this project, create a cluster of animals based on their milk features like how much protein, fat, lactose, ash, water contains in their milk.
Key highlights of projects or Essence
This project is about clustering analysis.
This project shows you how to read the data and perform some basic Exploratory data analysis using pyspark
This project shows you how to perform data preprocessing.
This project shows you how to create cluster of unlabeled data
Packages and module used :
Pyspark
VectorAssembler
KMeans
ClusteringEvaluator
Matplotlib
StandardScaler
Recommended projects:
Mall customer analysis
Online retail customer analysis
Credit card analysis
Wine data analysis
Customer personality analysis using clustering
Skills:
Clustering, Pyspark, K-means