What is Data Analysis
Let see Data Analytics pipeline
Step 1: Data Acquisitions
Step 2: Data Preparation
Step 3: Data Representation
Step 4: Data Representation
Step 5: Data Analytics
Step 6: Results Interpretations
in this post we are going to Data analysis on twitter data
What kind of Analytics can be done in Twitter?
Counting. Real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.
Correlation. Near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.
Monitoring. Monitoring the customer opinions about a brand.
Research. More in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.
Network Analysis: ego network analysis, monitoring followers growth, community analysis
Sentiment Analysis: tracking events and hot topics, trending. monitoring customer opinions about the product
Other: tweet engagement (top popular tweets)
What is Spark?
Apache Spark is a framework for developing distributed computing applications.
As research project at the University of California, Berkeley.
Features of Apache Spark
Speed (in-memory computations)
Supports multiply languages (Java, Scala, Python, R)
Advanced Analytics (SQL queries, Streaming data, Machine learning and Graph algorithms )
System requirements
Java 8+
Scala 2.11.x
Pyhotn 2.7+
8GM RAM and 8-16 cores CPU
Who uses Spark and Why?
Data Scientist:
Analyze and model the data to obtain insights of the data;
Transforming the data into a useable format
Statistics, machine learning, SQL
Advanced analytics
Develop a data processing system or applications
Monitor, inspect and tune the applications
What is Spark SQL
load the data in .csv, .json and .parquet file format
relational queries expressed in SQL, Scala
use of SchemaRDD (abstract table)
DataFrame, Dataset
A Dataset is a strongly typed collection of domain-specific objects.
Each Dataset has an untyped view called a DataFrame (a Dataset of
Two types of operations on Datasets:
transformations (e.g., map(),filter(),select(),aggregate(), etc.)
actions (e.g., count(),show(), etc. )
Get Spark Homework Help, SPARK Assignment Help, Apache Spark Assignment Help, Apache Spark Experts
Getting the data/Load the data to Spark (SQLContext)
Understanding the data:
calculating basic statistics
making histograms
Cleaning the data:
filtering data
dealing with missing, incomplete data
Feature extraction:
Dealing with categorical data
Clustering (using Mllib)
Saving the data to:
MongoDB database
Visualizing Data (Zeeplin or d3.js)