Clustering
Use clustering code from scikit-learn
http://scikit-learn.org/stable/modules/clustering.html
1. Data
Run the kMeans example that illustrates the types of data shapes where kMeans performs well and when it does not.
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html#sphx-glr-
auto-examples-cluster-plot-kmeans-assumptions-py
You will see 4 plots. The first has the clustering of the original data, the other three have the clustering of the data after different transformations.
a. See in the code what transformation are applied, look up what they mean to understand better what they do.
b. 10 Points Summarize very briefly in a table.
i. What each transformation does the data.
ii. The transformation make the data much more difficult for kmeans. What make
the transformed data data more challenging for kmeans. This should be a simple and brief explation based on how kmeans works (partitional clustering, centroid based, very briefly talk how the distance to nearest centroid is sometimees misalinged ith the shape of the data. See lecture notes for details on these 2 properties of kmeans)
c. 10 points Best K. Use the elbow method to determine the best k for each
shape of the data.
i. Add small elbow plots for the SSE vs K for each data shape in the table
you started in b.ii.
ii. Decide what is the best K for each data shape based on c.i. Show in
table the best K for each data shape.
2. kMeans
a. The code runs kMeans on the original data and on 3 transformations.
i. Try 5 different values of k (the number of desired clusters) for each run of kmeans
1. Try to increase k and get more granular analysis of the data.
2. Try to decrease k.
ii. Report your findings. Put the results in a summary table for an easier comparison
1. The table should have a row for each data shape, columns can be the 5
values of k that you tried.
2. 5 points Use small screen shots of the resulting clustering for each k/data
shape and put into the table.
3. 5 points Put the SSE value for each k
4. 10 points Summary
a. Summarize what you see in a few sentences below the table. In your
summary address how increasing/Decreasing k changes how the
clusters cover the data. How does SSE change? Does the smallest SSE always go with the best clustering?
3. DBScan.
Run the DBScan example that illustrates how it works
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
a. Read about the clustering quality measures
i. Completeness/Homogeneity/V-measure
b. Try different combinations of the values of epsilon and min_samples.
c. Create a summary table.
d. 15 points Report 9 results: Three different values for epsilon and three
different values of min_samples. Use their combinations for your report.
First, experiment and try many different values - increase/decrease epsilon
and min_samples. See when you get very different results from the default
used in the example. Make note and report your results only for the default
values and for the most interesting 2 other values that you will see.
i. The table should contain min_samples and epsilon as rows/columns.
Add small screen shots of the results in the cells of the table. Also put
Completeness/Homogeneity/V-measure for each screen shot
ii. No text is necessary in the table.
e. 15 points Summary: Explain how your changes to these parameters change
the result. Base your explanation on the definition and role of these parameters during DBScan. Mention how the Completeness/Homogeneity/V-
measure change.
4. DBScan.
Data Transformation
a. Apply the transformations from the kMeans example to the DBScan input data. Experiment and and try many different values and increase/decrease epsilon and min_samples to find the combination of these parameters that produces the best result.
b. 15 points Create a summary table. Report only the DBScan results for the best combination you see in (4.a). Explain why you think those values of the parameters work best (for example, why setting epsilon to a smaller value like 0.1 could be good for the Anisotropicly transformation). What properties of the data distribution require small epsilon.
c. 15 points Compare to the output of kMeans with the best parameter k for each transformation. Briefly explain using your knowledge how these algorithms work why in some cases one is better that the other. What properties of the data distribution give different results for these 2 algorithms.
As an example, to illustrate what is required in this question. I applied the different variance transformation to the DBScan data. With my initial guess for the values of epsilon=0.2, min_samples=10, I got the clustering in the image Result 1. After changing the values of these two parameters in different combinations I got the clustering in image Result 2. Note the black nodes are noise and are not included in clustering, according to DBScan. This is more appropriate for the data.
Get solution of this project at affordable price. please send your query at contact@codersarts.com we'll share you price quote
How can you contact us for assignment Help.
Via Email: you can directly send your complete requirement files at email id contact@codersarts.com and our email team follow up there for complete discussion like deadline , budget, programming , payment details and expert meet if needed.
Website live chat: Chat with our live chat assistance for your basis queries and doubts for more information.
Contact Form: Fill up the contact form with complete details and we'll review and get back to you via email
Codersarts Dashboard : Register at Codersarts Dashboard, and track order progress
Comentários