Introduction
In this blog post, we will see into the intricacies of a significant project titled "Big Data Analytics - Coursework." This project revolves around understanding, analyzing, and deriving insights from the UNSW-NB15 dataset, which encompasses a rich collection of network traffic data designed for cybersecurity analysis.
Project Overview:
The project aims to conduct a comprehensive analysis of the UNSW-NB15 dataset, which encompasses a blend of real-world normal activities and synthetic contemporary attack behaviors. The dataset, generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS), is designed for big data analytics. The tasks involve understanding the dataset, querying and analyzing using Apache Hive, performing advanced analytics using PySpark, and documenting the entire process.
Tasks:
1 . Understanding Dataset: UNSW-NB15
The UNSW-NB151 dataset's raw network packets were generated using the IXIA PerfectStorm tool within the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS). This dataset was designed to combine real modern normal activities with synthetic contemporary attack behaviors. Tcpdump was employed to capture 100 GB of raw traffic, resulting in Pcap files. The dataset encompasses nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms. To further analyze the data, Argus and Bro-IDS tools were utilized, and twelve algorithms were developed, generating a total of 49 features along with their corresponding class labels.
a). The features are outlined in this section.
b). The quantity of attaks and their respective sub-categories is delineated here.
c). In this coursework, we utilize a total of 10 million records stored in a CSV file (available for download). The file size amounts to approximately 600MB, which is sufficiently large to warrant the application of big data methodologies for analysis. As a specialist in big data, our initial approach involves comprehending the dataset's features before implementing modeling techniques. Should you wish to view a subset of this dataset, you can import it into Hadoop HDFS and execute a Hive query to display the first 5-10 records for better comprehension.
Dataset Features : -
No. | Name | Type | Description |
1 | srcip | nominal | Source IP address |
2 | sport | integer | Source port number |
3 | dstip | nominal | Destination IP address |
4 | dsport | integer | Destination port number |
5 | proto | nominal | Transaction protocol |
6 | state | nominal | Indicates the state and its dependent protocol |
7 | dur | Float | Record total duration |
8 | sbytes | Integer | Source to destination transaction bytes |
9 | dbytes | Integer | Destination to source transaction bytes |
10 | sttl | Integer | Source to destination time to live value |
11 | dttl | Integer | Destination to source time to live value |
12 | sloss | Integer | Source packets retransmitted or dropped |
13 | dloss | Integer | Destination packets retransmitted or dropped |
14 | service | nominal | Service used |
15 | Sload | Float | Source bits per second |
16 | Dload | Float | Destination bits per second |
17 | Spkts | integer | Source to destination packet count |
18 | Dpkts | integer | Destination to source packet count |
19 | swin | integer | Source TCP window advertisement value |
20 | dwin | integer | Destination TCP window advertisement value |
21 | stcpb | integer | Source TCP base sequence number |
22 | dtcpb | integer | Destination TCP base sequence number |
23 | smeansz | integer | Mean of the flow packet size transmitted by the source |
24 | dmeansz | integer | Mean of the flow packet size transmitted by the destination |
25 | trans_depth | integer | Represents the pipelined depth into the connection of HTTP request/response transaction |
26 | res_bdy_len | integer | Actual uncompressed content size of the data transferred from the server’s HTTP service |
27 | Sjit | Float | Source jitter (milliseconds) |
28 | Djit | Float | Destination jitter (milliseconds) |
29 | Stime | Timestamp | Record start time |
30 | Ltime | Timestamp | Record last time |
31 | Sintpkt | Float | Source interpacket arrival time (milliseconds) |
32 | Dintpkt | Float | Destination interpacket arrival time (milliseconds) |
33 | tcprtt | Float | TCP connection setup round-trip time, the sum of ’synack’ and ’ackdat’ |
34 | synack | Float | TCP connection setup time, the time between the SYN and the SYN_ACK packets |
35 | ackdat | Float | TCP connection setup time, the time between the SYN_ACK and the ACK packets |
36 | is_sm_ips_ports | Binary | If source and destination IP addresses equal and port numbers equal, this variable takes value 1 |
37 | ct_state_ttl | Integer | No. for each state according to specific range of values for source/destination time to live |
38 | ct_flw_http_mthd | Integer | No. of flows that have methods such as Get and Post in HTTP service |
39 | is_ftp_login | Binary | If the FTP session is accessed by user and password then 1 else 0 |
40 | ct_ftp_cmd | integer | No. of flows that have a command in FTP session |
41 | ct_srv_src | integer | No. of connections that contain the same service and source address in 100 connections |
42 | ct_srv_dst | integer | No. of connections that contain the same service and destination address in 100 connections |
43 | ct_dst_ltm | integer | No. of connections of the same destination address in 100 connections |
44 | ct_src_ltm | integer | No. of connections of the same source address in 100 connections |
45 | ct_src_dport_ltm | integer | No. of connections of the same source address and the destination port in 100 connections |
46 | ct_dst_sport_ltm | integer | No. of connections of the same destination address and the source port in 100 connections |
47 | ct_dst_src_ltm | integer | No. of connections of the same source and destination address in 100 connections |
48 | attack_cat | nominal | The name of each attack category |
49 | Label | binary | 0 for normal and 1 for attack records |
2. Big Data Query & Analysis by Apache Hive
This task involves utilizing Apache Hive to transform large raw data into actionable insights for end users. The process begins by thoroughly understanding the dataset. Subsequently, at least 4 Hive queries should be formulated (refer to the marking scheme). Suitable visualization tools should be applied to present the findings both numerically and graphically. A brief interpretation of the findings should also be provided.
Finally, screenshots of the outcomes, including tables and plots, along with the scripts/queries should be included in the report.
3. Advanced Analytics using PySpark
In this section, you will conduct advanced analytics using PySpark.
3.1. Analyze and Interpret Big Data
We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.
3.2. Design and Build a Classifier
a) Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model.
b) Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.
3. Documentation
Document all your work. Your final report must follow 5 sections detailed in the “format of final submission” section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.
Summarize the key points discussed in the blog post and reiterate the significance of the project in the context of big data analytics. Encourage readers to explore further and engage with the provided sample assignment.
Codersarts provides tailored assistance for your big data analytics project, following the outlined tasks and objectives in your blog. Our team specializes in guiding you through each stage of the project, from understanding the dataset to implementing advanced analytics techniques.
We offer hands-on support in utilizing Apache Hive and PySpark for data transformation, querying, and analysis. Our experts ensure efficient preprocessing and feature engineering to enhance the accuracy of your models. With a focus on coding best practices, we ensure the best and reliability of your analytical solutions.
Codersarts facilitates thorough project evaluation, conducting quantitative assessments and offering insightful interpretations of your findings. Additionally, we provide services such as documentation review and problem-solving sessions to enhance the overall quality and success of your big data analytics endeavor.
If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.
Comments