News Article Clustering Analysis | Sample

Apr 26, 2022

Updated: May 10, 2022

Question 1

You are provided with a dataset “NewsArticles.json” having news articles of mixed topics including business, entertainment, politics, sports, technology, but without labels.

***Load the dataset: res/NewsArticles.json

You are required to make a clustering-based model.

Carry out the following tasks:

Perform K-Means clustering on the above dataset and find the value of Sum of Squared Error (SSE)
Use PCA algorithm to reduce the dimension of the dataset (about 100) and then perform K-means clustering on the manipulated dataset and find the value of Sum of Squared Error (SSE)
Find the cluster having the highest value of count (before PCA). Also,
Mention the highest value of count (before PCA)
Find the cluster having the highest value of count (after PCA). Also,
Mention the highest value of count (after PCA)
Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the cluster you think is of news articles related to the topic of entertainment (before PCA)
Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the third cluster (after PCA)

Hint: In both the above cases, use the number of clusters as 5 and compute Sum of Square Error within clusters.

NOTE :

1.Do not use any NLP concepts here for any kind of cleansing or preprocessing.

2. Write the code only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py

Final Output Sample:

Output Format:

Perform the above operations and write your output to a file named output.csv, which should be present at the location output/output.csv
output.csv should contain the answer to each question on consecutive rows.

Screenshot of Output

If you need solution for this assignment or have project a similar assignment, you can leave us a mail at contact@codersarts.com directly.

Get Help Now

News Article Clustering Analysis | Sample

Question 1

Screenshot of Output

Recent Posts

Comentários