Data Analysis Assignment Help with Spark In Python

Codersarts is a top rated website for students which is looking for online Data Analytics Assignment Help, Homework help, Coursework Help in Apcahce Spark, Pyspark, Mlib, tweepy others library and tools to students at all levels whether it is school, college and university level Coursework Help or Real time project. Hire us and Get your projects done by Data Analytics expert

There are two common Data Analytics over Social media data

Machine Learning Algorithms: Apply classification to Tweets
Real time analysis of Tweets: Spark Streaming Library

Data Analysis

Recommendation Models:

Content-based filtering
Collaborative filtering
Matrix factorization
Alternating least squares

Classification Models:

Linear models
Logistic regression
Support vector machines (SVM)
Decision trees
Naïve Bayes

Clustering Models:

k-means clustering
Hierarchical clustering
Kohonen node

Classification

Classification is a form of supervised learning where we train a model with

training examples

Can be used for:

Predicting the probability of Internet users clicking on an online advert; here, the classes are binary in nature (that is, click or no click)
Classifying images, video or sounds
Assigning categories or tags to news articles, web pages, tweets (multiclass)
Discovering e-mail and web spam (binary)
Ranking customers or users in order of probability that they might purchase a product or use a service
Predicting customers or users who might stop using a product, service or provider (called churn)
And other cases

Clustering

Clustering models is a form of unsupervised learning where each training

example is assigned to a segment called a cluster

Can be used for:

Segmenting users or customers into different groups based on behavior characteristics and metadata
Grouping content on a website or products in a retail business
Segmenting communities in social media networks
Topic clustering of Tweets

K-means clustering approach

Clustering is the process of grouping a set of objects into

classes of similar objects:

Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.

In principle, optimal partition achieved via minimising the sum of

squared distance to its “representative object” in each cluster

Historical Data Analysis with Mllib (MLDataAnalysis.scala):

Data Representation
Clustering tweets by text
Classification of tweets by sentiment (negative, positive,etc.)
Result visualization in Zepplin

Streaming Data Analysis (CollectingTweetsToFile.scala, CollectingTweetsToMongoDB.scala, witterStreamingAnalyzer.scala):

Stream tweets in json file
Stream tweets to MongoDB