Overview
KNIME Analytics Platform is a powerful and versatile tool for data analytics, exploration, and visualization. It provides a user-friendly interface and a wide range of functionalities to perform complex data processing tasks efficiently. In this guide, we will explore how to use the KNIME Analytics Platform to perform the assigned tasks effectively. By leveraging the capabilities of KNIME, we can gain valuable insights from data, clean and preprocess datasets, and derive meaningful analytics results for further analysis.
Introduction
Data cleaning is a crucial step in Data Science to obtain meaningful analytic results and achieve beneficial prediction outcomes. This assignment focuses on developing knowledge and analytical skills in properly cleaning unprocessed data for data analysis and model training. The given dataset is a collection of airline data with eight (8) attributes and 6,150 records. Each attribute is described in the data dictionary provided. The goal is to clean the dataset and answer the questions in Parts A-C.
Attribute Description
Airline ID: Unique OpenFlights identifier for this airline.
Name: Name of the airline.
Alias: Alias of the airline.
IATA: 2-letter IATA code, if available.
ICAO: 3-letter ICAO code, if available.
Callsign: Airline callsign.
Country: The airline's incorporated country or territory.
Active: "Y" if the airline is or has until recently been operational, "N" if it is defunct.
Problem Statement
The assigned tasks require us to perform various data cleaning and preprocessing steps using the provided dataset. Our goal is to properly clean and prepare the unprocessed data for further analysis and model training. By following the steps outlined in this guide, we can develop knowledge and analytical skills to successfully clean the dataset and derive meaningful insights from it.
Using KNIME Analytics Platform
KNIME Analytics Platform offers a comprehensive set of tools and functionalities to perform the assigned tasks efficiently. With its visual workflow editor, extensive library of nodes, and powerful data processing capabilities, KNIME enables us to tackle complex data cleaning challenges and derive accurate and reliable results.
Assignment Tasks
The dataset (airlines_2022.csv) is dirty, and the objective is to clean it according to the instructions from a senior data scientist. The following steps are proposed to clean the dataset in order to meet the requirements:
STEP 1: Check for duplicate tuples based on the value of Airline ID. Remove any duplicate tuples in the dataset.
STEP 2: Adjust the Airline ID values to range from 0 to any positive integer. If the original Airline ID is negative, set it to zero. Add a new column to store the cleaned data without overwriting the original Airline ID.
STEP 3: Ensure a valid airline Name should start with an English alphabet or numerical number only. If the original airline Name starts with a non-English alphabet, replace it with "unknown." Add a new column to store the cleaned data without overwriting the original airline Name.
STEP 4: Clean the Alias attribute. If the Alias value is "\N", "\n", or missing, replace it with "unknown." If other symbols appear in the Alias value, replace it with "unknown." Add a new column to store the cleaned data without overwriting the original Alias value.
STEP 5: Clean the IATA attribute. A valid IATA is composed of two English alphabets, two numerical numbers, or a combination of one English alphabet and one numerical number. If the IATA value is not valid or missing, replace it with "unknown." Add a new column to store the cleaned data without overwriting the original IATA value.
STEP 6: Clean the ICAO attribute. A valid ICAO should contain three characters only, which can be English alphabets or numerical numbers. If the ICAO value is "\N", "\n", not valid, or missing, replace it with "unknown." Add a new column to store the cleaned data without overwriting the original ICAO value.
STEP 7: Clean the Callsign attribute. A valid Callsign is composed of English alphabets and/or numerical numbers only. If the Callsign value is "\N", "\n", not valid, or missing, replace it with "unknown." Add a new column to store the cleaned data without overwriting the original Callsign value.
Questions
Part A:
How many unique tuples are there in the dataset after Step 1?
How many unique values are there in the Airline ID attribute after Step 2?
How many unique values are there in the Name attribute after Step 3?
How many unique values are there in the Alias attribute after Step 4?
How many unique values are there in the IATA attribute after Step 5?
How many unique values are there in the ICAO attribute after Step 6?
How many unique values are there in the Callsign attribute after Step 7?
If you require a solution for the above project or have any further inquiries, please feel free to contact us. At CodersArts, our team has expertise in data cleaning, analysis, and model training. We can help you transform your unprocessed data into valuable insights and assist you in achieving your project goals. Contact us via email or through our website to get started and revolutionize your data-driven solutions.
Comments