Water Base Data Analysis Assignment | Sample assignment

Apr 19, 2022

Updated: May 10, 2022

Datasets

Download file data.zip from the CMM510 assessment area of CampusMoodle.

Unzip data.zip. It contains the following files:

sampleWater1.csv: contains the first dataset about water base analysis of samples. In the remaining of this document this dataset will be referred as water1.
sampleWater2.csv: contains a second dataset about water base analysis of samples. In the remaining of this document this dataset will be referred as water2.
testWater.csv: contains a third dataset about water base analysis of samples, which you will use for testing. In the remaining of this document this dataset will be referred as testWater.

Each of the above files contains a dataset with details of analyses of the water base

using the following features (attributes):

siteIDScheme – either eiometMonitoringSiteCode or euMonitoringSiteCode.
WBCategory – the water base category (GW, LW, RW, TW).
determinandC – the determinand code (CAS_14798-03-9:328, CAS_7723-14-0 :632, EEA_3131-01-9 :700, EEA_3132-01-2 :825).
analysed – fraction of the sample analysed (dissolved, SPM, total).
media – type of media monitored (sediment, water).
NSamples– the number of samples.
minValue – minimum sample value used.
meanValue – mean sample value used.
maxValue – maximum sample value used.
sd: standard deviation for the sample values.
method: CEN/ISO code of the analytical method used.

TASK 1 Dataset Exploration and Classification

a. Load and inspect the 3 datafiles above. Explore the datasets, highlighting anything of interest. Use R code to explore and analyse these datasets. Highlight key observations about the data. [Word limit: 200 excluding code and/or plots]

b. Run two tree classifiers, and an instance-based classifier on water1 and a new dataset water containing both water1 and water2. The class is WBCategory. Note that you may have to pre-process the data files before you can use them for classification. Critically compare the performance of the 3 algorithms on the two datasets. In your explanations include the performance metric(s) and the evaluation method, the parameters used, the size of the datasets. Give details of any data pre-processing. [Word limit: 200 excluding code and/or results]

c. If one algorithm's performance was better than another one, discuss any reasons under which the lower performer algorithm may be preferred and state the confidence level (if any) at which the difference in performance is not statistically significant [Word limit: 100, excluding code and/or plots]

d. Test the 3 models you trained using the water dataset on the testWater dataset and compare their performance. [Word limit: 100, excluding code and/or plots]

TASK 2 Clustering and Additional Insights

a. Cluster the water dataset, undertaking any pre-processing required using ONE clustering algorithm discussed in the class. Discuss the ideal number of clusters and comment on what the clusters represent. Justify your choice of clustering algorithm and discuss whether the resulting clusters correlate with any attribute. [Word limit: 150, excluding code and/or plots]

b. Undertake one further data mining activity of your choice using one or more of the datasets available with this coursework to demonstrate your understanding of data mining. The data mining activity you choose must have been covered in this module (CMM510). [Word limit 200, excluding code and/or plots]

Submission

You are required to use the CMM510 CampusMoodle coursework drop box to submit

the following:

All your code in either an Rmd or an R file. Ensure that your code is suitably labelled and commented.
A word, pdf or html file containing all the information requested in Tasks 1 and 2, including:

a. All the settings used.

b. The R code used.

c. The results of running experiments, including any plots obtained.

d. Your observations and discussion regarding the results as required, ensuring

that you abide by the word limit for each section.

Note: the tasks (and subtasks) should appear in the order in which they are described in this coursework specification. All tasks and subtasks should have a clear title/heading to identify them.

Software to be used

R and RStudio

Deliverables

Code - an R or Rmd file containing all the code suitably commented. Document – an html, pdf or word document containing the code, results, explanations and discussions for all the tasks. These and associated word counts are discussed further on page 4