Big Data And Cloud Computing | Sample Assignment

Descriptions

The aim of this assignment is to introduce a practical application of Big Data and Cloud Computing using a realistic big data problem. Students will implement a solution using an industry leading Cloud computing provider together with the distributed processing environment Apache Spark. This will involve the selection of problem appropriate Machine Learning algorithms and methods.

Data Sets and Formats for the Assignment

There are several datasets available for this assignment. These have been installed on AWS S3. Further data is available on data.gov.uk should you wish more detail.

The data sets are:

all_crimes18_hdr.txt.gz (14GB Compressed, 43x10^6 records)
LSOA_pop_v2.csv (2.4MB uncompressed)
postcodes.gz (0.6GB Compressed)
posttrans.csv (23.5 kB uncompressed)

All the files are csv format and may be compressed with gzip. Spark natively understands this compression format, so you may use the files just as CSV files.

Location measurement

In these datasets location is specified in several ways:

‘crime data’ uses ‘anonymized’ longitude and latitude AND LSOA code,
The LSOA dataset uses LSOA code.
Postcodes uses area centred longitude and latitude, postcode, LSOA code, Map Grid reference and many other measures

When considering which to use, you should bear in mind the level of detail needed:

The LSOA covers a mean of 1500 people. This means that LSOA will not have the same level of accuracy as the other measures. (but will be easier to handle)
Longitude and Latitude are accurate to 2-3m. you will need to convert these into postocdes.
A full postcode (e.g NE1 8ST, NE2 1XE) corresponds to approximately 6 households. You can generate a larger area by using the summing data over the first part (NE1, NE2)

The Crimes Data

all_crimes18_hdr.txt.gz contains about 43million reported and logged crimes from 2010-2017. The data were downloaded from https://data.police.uk/data/. This site offers data by month, and by force. Consequently, they have been merged into one file for this assignment.

You can find out more about the data here (https://data.police.uk/about/#columns). Only ‘street’ files have been included. Outcomes are included.

The header row of the crimes data is:

'Crime ID', 'Month', 'Reported by', 'Falls within', 'Longitude', 'Latitude', 'Location', 'LSOA code', 'LSOA name', 'Crime type', 'Last outcome category'

Note that Longitude and Latitude are anonymized as described on the police web site here: https://data.police.uk/about/#location-anonymisation. Since the police use around 750,000 'anonymous' map points it is unlikely that these coincide with the longitudes and latitudes given in the postcode dataset. For this reason, it is best to use LSOA (Lower Layer Super Output Area, UK Office for National Statistics )as a region indication.

The file posttrans.csv will allow the translation of crimes’ longitude and latitude into actual postcodes.

Location Data.

The headers of the LSOA_pop_v2.csv. file are:

"date","geography","geography code","Rural Urban","Variable: All usual residents; measures: Value","Variable: Males; measures: Value","Variable: Females; measures: Value","Variable: Lives in a household; measures: Value","Variable: Lives in a communal establishment; measures: Value","Variable: Schoolchild or full-time student aged 4 and over at their non term-time address; measures: Value","Variable: Area (Hectares); measures: Value","Variable: Density (number of persons per hectare); measures: Value"

Postcodes Data.

The headers of the postcodes.gz file are:

'Postcode','InUse?','Latitude','Longitude','Easting','Northing','GridRef','County','District','Ward','DistrictCode','WardCode','Country','CountyCode','Constituency','Introduced','Terminated','Parish','NationalPark','Population','Households','BuiltUpArea','Builtupsubdivision','Lowerlayersuperoutputarea','Rural/urban','Region','Altitude'

Posttrtans Data.

The headers of the posttrans.csv file are:

Postcode,Lon,Lat

https://s3.amazonaws.com/kf7032-20.northumbria.ac.uk/posttrans.csv

Assignment Overview

The portfolio assignment is divided into components as follows:

Training Tasks (30%)

Semi-formative elements of the portfolio constitute 30% of the assessment for this module and include, group, individual, and peer assessed work

Combined Big Data Product and Report: (70%)

Individual work – Combined Big Data Product and Report: “A Critical Assessment of the Big Data Approach to Crime Analysis”. This activity assesses module learning outcomes 1, 2, 3, 4 & 5. This practical element is the final module assessment.

Training Tasks

Training Task 1: Peer Reviewed Task

The objective of this task is to ensure that students have mastered these skills which are required for final module assessment:

Process a data set using the recommended software environment for the module.
Explaining the logical reasoning behind your code.

This work will be peer assessed as recommended the British Computer Society. That is, you will critically assess the work of fellow students (your peers) and THEY will assess yours.

In detail:

You will create a Jupyter notebook based the scenario below (which is derived from weekly worksheets 1-4) explaining your code using notebook embedded Markdown (i.e. formatted text, not just comments)
You will post your notebook to the module discussion board on Blackboard
You will then mark (i.e. peer review) the submission preceding yours on the discussion board, and the one following it, using the marking scheme below and post these mark sheets
Your mark for this task will be the average of your peer marks.

Scenario:

Suppose you are a police department with a limited budget. You plan to reduce road-traffic accidents by a one-month targeted advertising campaign.

Using the given dataset, which gender, age group, and month would be the largest target group as indicated by positive breath tests?

Training Task 2: Group work participation Task

The objective of this task is to derive background study materials for the big data product to be used by the whole class. These may include (but are limited to) reviewing the literature on crime and big data, examining published work on violent crime and its causes, technical approaches to crime and big data, relevant statistics and other computational methods. That is, to research the topic in general.

Working in teams of up to four students, each group will produce at least 2000 original words, plus 10 references to scientific conference or journal papers.

Since this is a group training task, your participation is assessed, rather than your content (Students will be able to receive staff feedback on content during taught sessions).

Group work participation Task Marking Scheme

Each group will score 6% of the module mark proportionally reduced by percentage of copied work as determined by Turnitin (threshold 10%), number of words less than 2000, number of references less than 10.

Examples:

Group A submit a total of 2100 words, plus 15 references which have a Turnitin similarity score of 8% (due to random matches). Each group member will score 6%.

Group B submit a total of 1500 words, plus 5 references which has a Turnitin similarity score of 20% (due to material copied from the Internet). Each group member will score:

(1500/2000) * (110-20)/100*(5/10) *6% = 2%

Big Data Product: Weapons and Drugs

In the television documentary “Ross Kemp and the Armed Police” broadcast 6th September 2018 by ITV, multiple claims were made regarding violent crime in the UK.

These claims were:

Violent Crime is increasing
There are more firearms incidents per head in Birmingham than anywhere else in the UK
Crimes involving firearms are closely associated with drugs offences

In this assignment you will investigate these claims using real, publicly available data sets that will be made available to you and placed in Amazon S3. These include, but are not limited to:

Street Level Crime Data published by the UK Home Office. This dataset contains 19 million data rows giving a crime type, together with their location as a latitude and longitude.
Land Registry Price Paid Data: This gives the postcode of a property, the property type from an enumeration of D (Detached), S (Semi-Detached), T (Terraced), F (Flats/Maisonettes) and the price paid.
Postcode Data: This data set is based on material provided by the Ordinance Survey. It gives a latitude and longitude to every postcode. This is useful as it relates between the Land Registry Price Paid dataset postcode, and the original crime dataset latitude/longitude.

Specifics

Process the data prepared for you using Apache Spark.
Filter the dataset so that crimes refer to relevant events only.
Using appropriate visualization methods, statistics, and machine learning, determine whether the claims made by Ross Kemp were true, false, or could not be determined.
Explain the reasoning behind your code so that it is clear what each block is intended to achieve, and why.
Report critically on the advantages, disadvantages, and limitations of the methods used.
Your submission will be a Jupyter Notebook containing both code (typically Python), and explanatory text (Markdown) limited to 2500 words (plus references). References from scientific literature must be used and your discussion must be your own words. DO NOT CUT AND PASTE FROM THE INTERNET.