What is Data Analysis
Let see Data Analytics pipeline
Step 1: Data Acquisitions
Step 2: Data Preparation
Step 3: Data Representation
Step 4: Data Representation
Step 5: Data Analytics
Step 6: Results Interpretations
in this post we are going to Data analysis on twitter data
What kind of Analytics can be done in Twitter?
Counting. Real-time counting analytics such as how many requests per day, how many sign-ups, how many times a certain word appears, etc.
Correlation. Near-real-time analytics such as desktop vs. mobile users, which devices fail at the same time, etc.
Monitoring. Monitoring the customer opinions about a brand.
Research. More in-depth analytics that run in batch mode on the historical data such as what features get re-tweeted, detecting sentiments, etc.
Network Analysis: ego network analysis, monitoring followers growth, community analysis
Sentiment Analysis: tracking events and hot topics, trending. monitoring customer opinions about the product
Other: tweet engagement (top popular tweets)
What is Spark?
Apache Spark is a framework for developing distributed computing applications.
As research project at the University of California, Berkeley.
Features of Apache Spark
Speed (in-memory computations)
Supports multiply languages (Java, Scala, Python, R)
Advanced Analytics (SQL queries, Streaming data, Machine learning and Graph algorithms )
System requirements
Java 8+
Scala 2.11.x
Pyhotn 2.7+
8GM RAM and 8-16 cores CPU
Who uses Spark and Why?
Data Scientist:
Analyze and model the data to obtain insights of the data;
Transforming the data into a useable format
Statistics, machine learning, SQL
Advanced analytics
Engineers:
Develop a data processing system or applications
Monitor, inspect and tune the applications
What is Spark SQL
Allows:
load the data in .csv, .json and .parquet file format
relational queries expressed in SQL, Scala
use of SchemaRDD (abstract table)
DataFrame, Dataset
A Dataset is a strongly typed collection of domain-specific objects.
Each Dataset has an untyped view called a DataFrame (a Dataset of
Row)
Two types of operations on Datasets:
transformations (e.g., map(),filter(),select(),aggregate(), etc.)
actions (e.g., count(),show(), etc. )
Get Spark Homework Help, SPARK Assignment Help, Apache Spark Assignment Help, Apache Spark Experts
Getting the data/Load the data to Spark (SQLContext)
Understanding the data:
calculating basic statistics
making histograms
Cleaning the data:
filtering data
dealing with missing, incomplete data
Feature extraction:
Dealing with categorical data
Clustering (using Mllib)
Saving the data to:
files
MongoDB database
Visualizing Data (Zeeplin or d3.js)