Social Media Data Analysis with Spark In Python

What is Data Analysis

Step 1: Data Acquisitions

Step 2: Data Preparation

Step 3: Data Representation

Step 4: Data Representation

Step 5: Data Analytics

Step 6: Results Interpretations

in this post we are going to Data analysis on twitter data

Counting. Real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.
Correlation. Near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.
Monitoring. Monitoring the customer opinions about a brand.
Research. More in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.
Network Analysis: ego network analysis, monitoring followers growth, community analysis
Sentiment Analysis: tracking events and hot topics, trending. monitoring customer opinions about the product
Other: tweet engagement (top popular tweets)


Apache Spark is a framework for developing distributed computing applications.

As research project at the University of California, Berkeley.

Speed (in-memory computations)
Supports multiply languages (Java, Scala, Python, R)
Advanced Analytics (SQL queries, Streaming data, Machine learning and Graph algorithms )

Data Scientist:

Engineers:

Allows:

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset has an untyped view called a DataFrame (a Dataset of

Row)

Two types of operations on Datasets:

Get Spark Homework Help, SPARK Assignment Help, Apache Spark Assignment Help, Apache Spark Experts

Getting the data/Load the data to Spark (SQLContext)

Understanding the data:

Cleaning the data:

Feature extraction:

Clustering (using Mllib)

Saving the data to:

Visualizing Data (Zeeplin or d3.js)