Quick Start of Machine Learning and useful online Resources

In this Article, I'm going to show you details guideline from where you can use online resources available free while learning machine leaning skills.Also this will be helpful in coming future for you.

Let's first see overview of machine learning.

Step 1: Data collection and preparation

Importing data
Cleaning data
Splitting it into train/test or cross-validation sets
Pre-processingTransformations
Feature engineering

Step 2: Model creation and fitting

There are lots of machine learning model to train and fit you data. this depends upon

data intuition for model-to-problem fit. You will have to first understand the nature of data and then select model.

you can ask following question before selecting the model

Which models are robust to missing data?
Which models handle categorical features well?

Step 3: Predicting the result based on training data testing

Data Sources

UCI Machine Learning Repository – 481 data sets as a service to the machine learning community. You may view all data sets as per your interest.
Kaggle Datasets – 100+ datasets uploaded by the Kaggle community. There are some really fun datasets here, including PokemonGo spawn locations and Burritos in San Diego.
data.gov – Open datasets released by the U.S. government. Great place to look if you’re interested in social sciences.

Sport Data Analysis

The sports world has a ton of data to play with. Data for teams, games, scores, and players are all tracked and freely available online. There are plenty of fun machine learning projects for beginners.

For example, you could try

Sports betting: Predict box scores given the data available at the time right before each new game.
Talent scouting: Use college statistics to predict which players would have the best professional careers.
General managing: Create clusters of players based on their strengths in order to build a well-rounded team.

Sports is also an excellent domain for practicing data visualization and exploratory analysis. You can use these skills to help you decide which types of data to include in your analyses.

Data Sources

Sports Statistics Database – Sports statistics and historical data covering many professional sports and several college ones. Clean interface makes it easier for web scraping.
Sports Reference – Another database of sports statistics. More cluttered interface, but individual tables can be exported as CSV files.
cricsheet.org – Ball-by-ball data for international and IPL cricket matches. CSV files for IPL and T20 internationals matches are available.

Predict Stock Prices

The stock market is like candy-land for any data scientists who are even remotely interested in finance.

First, you have many types of data that you can choose from. You can find prices, fundamentals, global macroeconomic indicators, volatility indices, etc… the list goes on and on.

Second, the data can be very granular. You can easily get time series data by day (or even minute) for each company, which allows you think creatively about trading strategies.

Finally, the financial markets generally have short feedback cycles. Therefore, you can quickly validate your predictions on new data.

Some examples of beginner-friendly machine learning projects you could try include:

Quantitative value investing: Predict 6-month price movements based fundamental indicators from companies’ quarterly reports.

Forecasting: Build time series models, or even recurrent neural networks, on the delta between implied and actual volatility.

Statistical arbitrage: Find similar stocks based on their price movements and other factors and look for periods when their prices diverge.

Building trading models to practice machine learning is simple. Making them profitable is extremely difficult. Nothing here is financial advice, and we do not recommend trading real money.

Data Sources

Quandl – Data market that provides free (and premium) financial and economic data. For example, you can bulk download end-of-day stock prices for over 3000 US companies or economic data from the Federal Reserve.

Quantopian – Quantitative finance community that offers a free platform for developing trading algorithm. Includes datasets.

US Fundamentals Archive – 5 years of fundamentals data for 5000+ U.S. companies.

Teach a Neural Network to Read Handwriting

Neural networks and deep learning are two success stories in modern artificial intelligence. They’ve led to major advances in image recognition, automatic text generation, and even in self-driving cars. To get involved with this exciting field, you should start with a manageable dataset.

The MNIST Handwritten Digit Classification Challenge is the classic entry point. Image data is generally harder to work with than “flat” relational data. The MNIST data is beginner-friendly and is small enough to fit on one computer. Handwriting recognition will challenge you, but it doesn’t need high computational power.

To start, we recommend with the first chapter in the tutorial below. It will teach you how to build a neural network from scratch that solves the MNIST challenge with high accuracy.

Data Sources

MNIST – MNIST is a modified subset of two datasets collected by the U.S. National Institute of Standards and Technology. It contains 70,000 labeled images of handwritten digits.

Investigate Enron

The Enron scandal and collapse was one of the largest corporate meltdowns in history.

In the year 2000, Enron was one of the largest energy companies in America. Then, after being outed for fraud, it spiraled downward into bankruptcy within a year.

Luckily for us, we have the Enron email database. It contains 500 thousand emails between 150 former Enron employees, mostly senior executives. It’s also the only large public database of real emails, which makes it more valuable.

In fact, data scientists have been using this dataset for education and research for years.

Examples of machine learning projects for beginners you could try include:

Anomaly detection: Map the distribution of emails sent and received by hour and try to detect abnormal behavior leading up to the public scandal.

Social network analysis:Build network graph models between employees to find key influencers.

Natural language processing: Analyze the body messages in conjunction with email metadata to classify emails based on their purposes.

Data Sources

Enron Email Dataset – This is the Enron email archive hosted by CMU.

Description of Enron Data (PDF) – Exploratory analysis of Enron email data that could help you get your grounding.

Write ML Algorithms from Scratch

Writing machine learning algorithms from scratch is an excellent learning tool for two main reasons.

First, there’s no better way to build true understanding of their mechanics. You’ll be forced to think about every step, and this leads to true mastery.

Second, you’ll learn how to translate mathematical instructions into working code. You’ll need this skill when adapting algorithms from academic research.

To start, we recommend picking an algorithm that isn’t too complex. There are dozens of subtle decisions you’ll need to make for even the simplest algorithms.

After you’re comfortable building simple algorithms, try extending them for more functionality. For example, try extending a vanilla logistic regression algorithm into a lasso/ridge regression by adding regularization parameters.

Finally, here’s a tip every beginner should know: Don’t be discouraged is your algorithm is not as fast or fancy as those in existing packages. Those packages are the fruits of years of development!

Mine Social Media Sentiment

Social media has almost become synonymous with “big data” due to the sheer amount of user-generated content.

Mining this rich data can prove unprecedented ways to keep a pulse on opinions, trends, and public sentiment. Facebook, Twitter, YouTube, WeChat, WhatsApp, Reddit… the list goes on and on.

Furthermore, every generation is spending even more time on social media than their predecessors. This means that social media data is will become even more relevant for marketing, branding, and business as a whole.

While there are many popular social media platforms out there,

Twitter is the classic entry point for practicing machine learning.

With Twitter data, you get an interesting blend of data (tweet contents) and meta-data (location, hashtags, users, re-tweets, etc.) that open up nearly endless paths for analysis.

Data Sources

Twitter API – The twitter API is a classic source for streaming data. You can track tweets, hashtags, and more.

StockTwits API – StockTwits is like a twitter for traders and investors. You can expand this dataset in many interesting ways by joining it to time series datasets using the timestamp and ticker symbol.

Improve Health Care

Another industry that’s undergoing rapid changes thanks to machine learning is global health and health care.

In most countries, becoming a doctor requires many years of education. It’s a demanding field with long hours, high stakes, and an even higher barrier to entry.

As a result, there has recently been significant effort to alleviate doctors’ workload and improve the overall efficiency of the health care system with the help of machine learning.

Uses cases include:

Preventative care: Predicting disease outbreaks on both the individual and the community level.
Diagnostic care: Automatically classifying image data, such as scans, x-rays, etc.
Insurance: Adjusting insurance premiums based on publicly available risk factors.
As hospitals continue to modernize patient records and as we collect more granular health data, there will be an influx of low-hanging fruit opportunities for data scientists to make a difference.

Data Sources

Large Health Data Sets – Collection of large health-related datasets

data.gov/health – Datasets related to health and health care provided by the U.S. government.

Health Nutrition and Population Statistics – Global health, nutrition, and population statistics provided by the World Bank