Utilizing KMeans for Geospatial Data Analysis and Clustering

Introduction

Welcome to this tutorial on geospatial data analysis with Python! In the modern world, location data plays a crucial role in various applications, from urban planning to logistics and even marketing. Understanding patterns in geospatial data can provide valuable insights for decision-making. In this tutorial, we'll guide you through the process of loading, cleaning, analyzing, and visualizing geospatial data using Python. By the end of this tutorial, you will have the skills to apply clustering techniques, specifically KMeans, to uncover hidden patterns in location data and visualize these patterns.

What You Will Learn

In this tutorial, you will:

Load Geospatial Data: Understand how to load and explore geospatial datasets using Python libraries like Pandas and Geopandas.
Clean the Data: Learn techniques for cleaning and preparing geospatial data for analysis.
Analyze the Data with KMeans Clustering: Apply the KMeans clustering algorithm to identify patterns in geospatial data.
Visualize Clusters on a Map: Create informative and visually appealing maps to display the results of your analysis.

Prerequisites

Before starting this tutorial, you should have:

Basic Python Knowledge: Familiarity with Python programming, including working with libraries like Pandas.
Python Environment: A Python environment set up on your machine, such as Jupyter Notebook, Google Colab, or a local Python installation.
Installed Libraries: Ensure that you have the following Python libraries installed:
- Pandas
- Geopandas
- Scikit-learn
- Matplotlib (for visualization)

Loading Data

We begin by loading our dataset, which consists of latitude and longitude coordinates. We'll read the data from a CSV file and inspect its contents.

import pandas as pd

# Read data from a CSV file
data = pd.read_csv('data.csv', header=None, names=['latitude', 'longitude'])
data.head()

# Check the details of the data frame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2050 entries, 0 to 2049
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   2050 non-null   float64
 1   longitude  2050 non-null   float64
dtypes: float64(2)
memory usage: 32.2 KB

Finding Locations

To find the state and country for each coordinate, we'll use the geopy library's Nominatim API. If you haven't installed geopy, uncomment the following line to install it:

# ! pip install geopy

Here is the function to retrieve location details:

from geopy.geocoders import Nominatim

def place(lat, lon):
    '''
    Retrieve state and country from coordinates.
    '''
    geolocator = Nominatim(user_agent="geoapiExercises")
    location = geolocator.reverse(f"{lat},{lon}")
    try:
        address = location.raw['address']
        state = address.get('state', '')
        country = address.get('country', '')
    except:
        state, country = None, None
    return state, country

This Python code uses the geopy library to retrieve the state and country information from latitude and longitude coordinates.

Geopy and Nominatim: The code imports Nominatim from the geopy.geocoders module. Nominatim is a geocoding service provided by OpenStreetMap (OSM) that converts coordinates into human-readable addresses through reverse geocoding.
Function Definition: The function place(lat, lon) is defined to accept two parameters, lat (latitude) and lon (longitude), representing the geographic coordinates.
Geolocator Object: Inside the function, a Nominatim object named geolocator is created with a user_agent parameter set to "geoapiExercises". This user agent identifies the application making the request to the Nominatim API, which helps with responsible usage.
Reverse Geocoding: The function calls geolocator.reverse() to perform reverse geocoding, which converts the latitude and longitude into a location object containing address details.
Extracting State and Country: The code attempts to extract the state and country from the location.raw['address'] dictionary. If successful, these values are returned; if an error occurs (e.g., if the location data is incomplete or not found), the function returns None for both state and country.

We apply this function to each coordinate in the data frame:

state, country = [], []

for lat, lon in zip(data['latitude'], data['longitude']):
    s, c = place(lat, lon)
    state.append(s)
    country.append(c)

# Add state and country to the dataframe
cdata = data.copy()
cdata['state'] = state
cdata['country'] = country

cdata.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1081 entries, 0 to 1080
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   1081 non-null   float64
 1   longitude  1081 non-null   float64
 2   state      1081 non-null   object 
 3   country    1081 non-null   object 
dtypes: float64(2), object(2)
memory usage: 42.2+ KB

This code block processes a DataFrame containing geographic coordinates and adds new columns for state and country information.

Initialization: Two empty lists, state and country, are initialized to store the state and country names corresponding to each coordinate.
Iterating Through Coordinates: The code uses a for loop to iterate over pairs of latitude and longitude values from the data DataFrame. The zip() function is used to combine the latitude and longitude columns into pairs that can be processed together.
Calling the place Function: For each pair of latitude (lat) and longitude (lon), the place() function (defined earlier) is called to retrieve the state (s) and country (c). These values are then appended to the respective state and country lists.
Copying the DataFrame: The code creates a copy of the original data DataFrame named cdata. This is done to avoid modifying the original DataFrame directly, preserving the initial data.
Adding New Columns: The lists state and country are added as new columns to the cdata DataFrame, corresponding to each row's geographic coordinates.
Displaying the DataFrame: The head() function is called on cdata to display the first few rows of the updated DataFrame, which now includes the state and country columns.
DataFrame Summary: The final DataFrame has four columns: latitude, longitude, state, and country, with all entries non-null. The DataFrame contains 1,081 rows, and the memory usage is approximately 42.2 KB.

Cleaning the Data

After adding state and country information, we need to remove any invalid coordinates:

# Remove rows with missing state or country
cdata = cdata.dropna()
cdata.info()

# Check and replace blank state names
cdata[cdata['state'] == '']

for i, row in cdata.iterrows():
    if row['state'] == '':
        cdata.loc[i, 'state'] = row['country'] + '_state'

cdata.head(20)

This code block handles missing or blank state and country data in the cdata DataFrame by cleaning and replacing values.

Removing Rows with Missing Data:
- The dropna() function is used on cdata to remove any rows where either the state or country columns have missing (NaN) values. This step ensures that all remaining rows have complete data for both state and country.
- The info() function is called to display information about the cleaned DataFrame, including the number of remaining rows and columns, and data types.
Identifying Blank State Names:
- The code identifies rows where the state column is an empty string (''). These rows are likely cases where the place() function couldn't find a state name but successfully retrieved a country name.
- The cdata[cdata['state'] == ''] expression is used to filter and inspect such rows.
Replacing Blank State Names:
- The iterrows() function is used to iterate over each row in the DataFrame. This allows for row-by-row processing, where the state column is checked for blank entries.
- If a row's state value is blank, it is replaced with the corresponding country name followed by '_state'. This provides a placeholder name indicating that the specific state information was unavailable.
Displaying Updated Data:
- Finally, the head(20) function is used to display the first 20 rows of the updated cdata DataFrame. This allows you to inspect the changes and verify that the blank state names have been replaced as intended.

Save the cleaned data for further analysis:

cdata.to_csv('clean_data_x_lat.csv')

Exploratory Data Analysis

With the cleaned data, we can perform some exploratory analysis:

Top 10 States with Highest Patient Count

import matplotlib.pyplot as plt

print('Total confirmed cases:', len(cdata))

cdata.state.value_counts().sort_values(ascending=False)[:10][::-1].plot(kind='barh', width=0.8, alpha=0.8, color='crimson', fontsize=15, figsize=(12,8))
plt.show()

This code snippet generates a horizontal bar chart to visualize the top 10 states with the most confirmed cases from the cleaned DataFrame cdata.

Breakdown:

Plotting Total Confirmed Cases:
- The code first prints out the total number of confirmed cases by calculating the length of cdata using the len() function.
Counting Cases by State:
- The value_counts() method is called on the state column of cdata to count the number of occurrences (i.e., cases) for each state. This gives a series with state names as the index and the corresponding counts as the values.
- The result is sorted in descending order using sort_values(ascending=False) to identify the states with the highest number of cases.
Selecting Top 10 States:
- The slicing [:10] retrieves the top 10 states with the most cases. The [::-1] reverses the order to ensure that the state with the most cases appears at the bottom of the horizontal bar chart.
Plotting the Bar Chart:
- The plot() function is used to create the horizontal bar chart (kind='barh').
- Additional parameters customize the appearance:
  - width=0.8: Sets the width of the bars.
  - alpha=0.8: Adjusts the transparency of the bars.
  - color='crimson': Specifies the color of the bars.
  - fontsize=15: Sets the font size for the labels.
  - figsize=(12,8): Sets the size of the figure.
Displaying the Plot:
- Finally, plt.show() is called to display the horizontal bar chart.

Top 10 Countries with Highest Patient Count

cdata.country.value_counts().sort_values(ascending=False).head(10)[::-1].plot(kind='barh', width=0.8, alpha=0.8, color='forestgreen', fontsize=15, figsize=(12,8))
plt.show()

This code snippet generates a horizontal bar chart to visualize the top 10 countries with the most confirmed cases from the cleaned DataFrame cdata.

Breakdown:

Counting Cases by Country:
- The value_counts() method is called on the country column of cdata to count the number of occurrences (i.e., cases) for each country. This gives a series with country names as the index and the corresponding counts as the values.
- The resulting series is sorted in descending order using sort_values(ascending=False) to list the countries with the highest number of cases first.
Selecting Top 10 Countries:
- The head(10) function is used to select the top 10 countries with the most cases. The [::-1] operation reverses the order so that the country with the most cases appears at the bottom of the horizontal bar chart.
Plotting the Bar Chart:
- The plot() function is used to create the horizontal bar chart (kind='barh').
- Additional parameters customize the chart's appearance:
  - width=0.8: Sets the width of the bars.
  - alpha=0.8: Adjusts the transparency of the bars.
  - color='forestgreen': Specifies the color of the bars.
  - fontsize=15: Sets the font size for the labels.
  - figsize=(12,8): Specifies the size of the figure.
Displaying the Plot:
- plt.show() is called to render and display the horizontal bar chart.

Clustering and Visualization

Now, let's apply KMeans clustering to group the data and visualize the clusters on a map.

Applying KMeans

from sklearn.cluster import KMeans

def run_kmeans(k):
    '''
    Perform KMeans clustering on the data.
    '''
    data = cdata[['latitude', 'longitude']]
    kmeans = KMeans(k)
    kmeans.fit(data)
    new_data = data.copy()
    new_data['cluster'] = kmeans.labels_
    cc = kmeans.cluster_centers_
    return new_data, cc

# Run KMeans with 10 clusters
n, cc = run_kmeans(10)

This code snippet performs KMeans clustering on geospatial data and then applies the clustering results to the data.

Breakdown:

Importing KMeans:
- from sklearn.cluster import KMeans: This imports the KMeans class from the sklearn.cluster module, which is used for performing KMeans clustering.
Defining the run_kmeans Function:
- Input Parameter:
  - k: The number of clusters to form.
- Functionality:
  - Data Selection:
    - data = cdata[['latitude', 'longitude']]: Extracts the latitude and longitude columns from the DataFrame cdata to use as input for clustering.
  - KMeans Initialization:
    - kmeans = KMeans(k): Initializes a KMeans object with k clusters.
  - Fitting the Model:
    - kmeans.fit(data): Fits the KMeans model to the data.
  - Adding Cluster Labels:
    - new_data = data.copy(): Creates a copy of the data.
    - new_data['cluster'] = kmeans.labels_: Adds a new column 'cluster' to new_data containing the cluster labels assigned by KMeans.
  - Cluster Centers:
    - cc = kmeans.cluster_centers_: Retrieves the coordinates of the cluster centers.
- Returns:
  - new_data: The DataFrame with an additional column for cluster labels.
  - cc: The coordinates of the cluster centers.
Running KMeans Clustering:
- n, cc = run_kmeans(10): Calls the run_kmeans function with 10 clusters, storing the results in n (the DataFrame with cluster labels) and cc (the cluster centers).

Scatter Plot of Clusters

plt.figure(figsize=(12,6))
plt.scatter(n['latitude'], n['longitude'], c=n['cluster'])
plt.xlabel('Latitudes')
plt.ylabel('Longitudes')
plt.title('Clusters in the Data')
plt.show()

This code generates a scatter plot that visualizes how the data points are grouped into clusters based on their geographical coordinates, with different colors representing different clusters.

Plot Initialization:
- plt.figure(figsize=(12,6)): Creates a new figure for the plot with a size of 12 by 6 inches.
Scatter Plot Creation:
- plt.scatter(n['latitude'], n['longitude'], c=n['cluster']):
  - Plots a scatter plot where the x-coordinates are the latitudes and the y-coordinates are the longitudes from the DataFrame n.
  - c=n['cluster']: Colors the points based on their cluster labels. Each cluster is represented by a different color.
Adding Labels and Title:
- plt.xlabel('Latitudes'): Sets the label for the x-axis.
- plt.ylabel('Longitudes'): Sets the label for the y-axis.
- plt.title('Clusters in the Data'): Sets the title of the plot.
Displaying the Plot:
- plt.show(): Renders and displays the scatter plot.

Plotting Cluster Centers on a Map

We can visualize the clusters and their centers using various styles:

Style 1: Slider for K-value

from ipywidgets import interact

def submit(K_value):
    n, cc = run_kmeans(K_value)
    cc_df = pd.DataFrame(cc, columns=['latitude', 'longitude'])
    cc_df['cluster_count'] = n.groupby('cluster').size()
    
    import plotly.graph_objects as go

    fig = go.Figure()
    for i in range(len(cc_df)):
        df_sub = n[n['cluster'] == i]
        fig.add_trace(go.Scattergeo(
            locationmode='country names',
            lon=[cc_df['longitude'][i]],
            lat=[cc_df['latitude'][i]],
            text='cluster count :' + str(len(df_sub)),
            marker=dict(size=len(df_sub)/10, color='orangered', line_color='white', line_width=0.5, sizemode='area'),
            name='{0}'.format(i)))
    
    fig.update_layout(
        title_text='Geographical Distribution of COVID Cases',
        showlegend=True,
        width=700,
        height=800,
        geo=dict(
            scope='africa',
            landcolor="silver",
            countrycolor="darkslategray",
            showocean=True,
            oceancolor="darkslategray",
            projection_type="orthographic"
        )
    )
    fig.show()

print('Adjust the value of K by sliding')
interact(submit, K_value=(2, 51, 1), continuous_update=False)

This code provides an interactive way to explore how different numbers of clusters affect the geographical distribution of data points, with results visualized on an interactive map.

Interactive Widget:
- from ipywidgets import interact: Imports the interact function from ipywidgets to create interactive widgets in Jupyter notebooks.
Function Definition:
- def submit(K_value): Defines a function submit that takes the number of clusters, K_value, as input.
KMeans Clustering:
- n, cc = run_kmeans(K_value): Calls the run_kmeans function with K_value to perform clustering and obtain the cluster centers and labels.
Prepare Data for Plotting:
- cc_df = pd.DataFrame(cc, columns=['latitude', 'longitude']): Converts the cluster centers into a DataFrame with latitude and longitude columns.
- cc_df['cluster_count'] = n.groupby('cluster').size(): Adds a column for the count of points in each cluster.
Plotly Visualization:
- import plotly.graph_objects as go: Imports plotly's graph_objects for advanced plotting.
- fig = go.Figure(): Initializes a new figure for plotting.
- for i in range(len(cc_df)): Loops through each cluster.
  - df_sub = n[n['cluster'] == i]: Filters the data for the current cluster.
  - fig.add_trace(go.Scattergeo(...)): Adds a geographical scatter plot for each cluster center with size proportional to the number of points in that cluster.
- fig.update_layout(...): Configures the map layout, including title, legend, and geographical settings (scope, land color, ocean color, etc.).
Display Interactive Widget:
- print('Adjust the value of K by sliding'): Provides a prompt for adjusting the number of clusters.
- interact(submit, K_value=(2, 51, 1), continuous_update=False): Creates a slider widget that allows users to adjust K_value between 2 and 51. The submit function is called each time the slider value changes.

Style 2: Text Box for K-value

import ipywidgets as widgets

def submit(K_value):
    if K_value == '':
        K_value = 3
    K = int(K_value)
    n, cc = run_kmeans(K)
    cc_df = pd.DataFrame(cc, columns=['latitude', 'longitude'])
    cc_df['cluster_count'] = n.groupby('cluster').size()
    
    import plotly.graph_objects as go
    fig = px.scatter_mapbox(cc_df, lat=cc[:, 0], lon=cc[:, 1], hover_name=cc_df.index, hover_data=['cluster_count'],
                           color_discrete_sequence=["red"], zoom=3, height=300, size=(cc_df['cluster_count'] / len(cc_df)) * 0.2)
    fig.update_layout(mapbox_style="stamen-terrain", width=900, height=800,
                      title_text='Geographical Distribution of COVID Cases',
                      showlegend=True)
    fig.update_layout(margin={"r": 0, "l": 0, "b": 0})
    fig.show()

print('Enter the value for K in the text box below:')
interact(submit, K_value=widgets.Text(value='3', description='K-value', disabled=False), continuous_update=False)

This code snippet allows users to dynamically input the number of clusters for KMeans and visualizes the results on an interactive map, with cluster sizes and counts displayed.

Function Definition:
- def submit(K_value): Defines the function submit, which now handles the number of clusters provided as input through a text box.
Handling Empty Input:
- if K_value == '': K_value = 3: Checks if the input value is empty; if so, it defaults to 3.
Convert Input to Integer:
- K = int(K_value): Converts the input string to an integer for use in clustering.
Run KMeans Clustering:
- n, cc = run_kmeans(K): Calls the run_kmeans function with K to get clustering results and cluster centers.
Prepare Data for Plotting:
- cc_df = pd.DataFrame(cc, columns=['latitude', 'longitude']): Creates a DataFrame with the cluster centers.
- cc_df['cluster_count'] = n.groupby('cluster').size(): Adds a column to the DataFrame to store the count of points in each cluster.
Plotly Visualization:
- import plotly.graph_objects as go: Imports Plotly's graph_objects for advanced plotting.
- fig = px.scatter_mapbox(...): Creates a scatter map using Plotly Express (px). It plots the cluster centers on a map with sizes proportional to the number of points in each cluster and hover information about cluster count.
- fig.update_layout(...): Updates the layout of the map, including the map style, dimensions, title, and legend settings.
- fig.update_layout(margin={"r": 0, "l": 0, "b": 0}): Adjusts margins to ensure the plot uses the full space available.
Display Interactive Widget:
- print('Enter the value for K in the text box below:'): Prompts users to enter the number of clusters.
- interact(submit, K_value=widgets.Text(value='3', description='K-value', disabled=False), continuous_update=False): Creates a text box widget for user input. The submit function is called when the text box value is changed.

style 3

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):
    
    n, cc = run_kmeans(K_value)
    n['state'] = cdata['state']
    n['country'] = cdata['country']
    
    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']
    
    import plotly.graph_objects as go
    

    fig = go.Figure()

    fig = px.scatter_mapbox(n, lon=n['longitude'], lat=n['latitude'], color = n['cluster'], zoom=3, height=300,
                           hover_name='state= '+n['state'] +'<br>'+'country= '+n['country'] )
    

    fig.update_layout(mapbox_style="carto-darkmatter", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)
    
    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()
    
    
print('adjust the value of K by sliding:')    
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

This code allows users to dynamically adjust the number of clusters for KMeans and view the geographical distribution of COVID cases on an interactive map, with additional information about the state and country for each data point.

Function Definition:
- def submit(K_value): Defines the function submit, which performs clustering and visualizes the results based on the input number of clusters, K_value.
Run KMeans Clustering:
- n, cc = run_kmeans(K_value): Executes the run_kmeans function to get the clustered data (n) and cluster centers (cc) using the specified number of clusters.
Add Additional Information:
- n['state'] = cdata['state']: Adds the state information to the clustered data (n).
- n['country'] = cdata['country']: Adds the country information to the clustered data.
Prepare Data for Plotting:
- labels = n.groupby('cluster')['cluster'].count().to_frame(): Creates a DataFrame with the count of data points in each cluster.
- cc_df = pd.DataFrame(cc, columns = ['latitude','longitude']): Creates a DataFrame with the cluster centers.
- cc_df['cluster_count'] = labels['cluster']: Adds the count of points in each cluster to the DataFrame of cluster centers (though cc_df is not used further in the code).
Plotly Visualization:
- import plotly.graph_objects as go: Imports Plotly's graph_objects for plotting.
- fig = px.scatter_mapbox(...): Creates a scatter mapbox plot using Plotly Express, showing the locations of clustered data. The color of each point is based on its cluster, and hover information displays the state and country.
Update Plot Layout:
- fig.update_layout(mapbox_style="carto-darkmatter", width=900, height=800, title_text='Geographical distribution of COVID cases', showlegend=True): Updates the map’s appearance and layout settings, including the map style, dimensions, title, and legend visibility.
- fig.update_layout(margin={"r":0,"l":0, "b":0}): Adjusts the margins to ensure the plot uses the full space available.
Display Interactive Widget:
- print('adjust the value of K by sliding:'): Prompts users to adjust the number of clusters using a slider.
- interact(submit, K_value=(2,51,1), continuous_update = False): Creates an interactive slider for users to select the number of clusters. The submit function is called whenever the slider value changes.

Putting it All Together

import pandas as pd

# read data from a csv file and load it into a pandas dataframe

data = pd.read_csv('data.csv', header=None, names=['latitude','longitude'])
data.head()

# check the details of the data frame
data.info()

"""**This dataframe only contains the coordinates of the places. We will check if all the coordinates are valid or not. To do this we will find the name of the state and the country corresponding to the coordinates provided to us.**

# Find out location
"""

# if you haven't already installed 'geopy' please uncomment the below line and run the cell
# ! pip install geopy

# import module
from geopy.geocoders import Nominatim

def place( lat,lon):
    '''
    This function takes in the coordinates of the place and returns the name of the state and the country

    Input:

    lon = Longitude
    lat = Latitude

    Output:
    State
    Country'''

    # initialize Nominatim API
    geolocator = Nominatim(user_agent="geoapiExercises")
    location = geolocator.reverse(str(lat)+","+str(lon))
    try :
        address = location.raw['address']
    except:
        pass

    if location== None:
        state = None
        country = None
    else:
        state = address.get('state', '')
        country = address.get('country', '')

    return state , country

"""**running the cell below will take a couple of minutes**"""

# Find the state and country corresponding to each coordinate in the dataframe and store it in a list using the function
# defined above

state, country = [], []

for lat , lon in zip(data['latitude'], data['longitude']):
    s, c = place(lat, lon)
    state.append(s)
    country.append(c)


# Create a copy of the original dataframe to add some modifications
cdata = data.copy()

# Make a new column for state and the country in the copy of the dataframe created above

cdata['state'] = state
cdata['country'] = country

cdata.head()

cdata = pd.read_csv('clean_data_x_lat.csv')
cdata = cdata.drop('Unnamed: 0', axis=1)
cdata.head()

"""# Filter out invalid coordinates

The 'state' and 'country' name corresponding to invalid coordinates will be of none type so will remove them.
"""

# count the no. of rows with none type state and countries.

print(len(cdata[cdata['state'].isnull()==True]))
print(len(cdata[cdata['country'].isnull()==True]))

# remove the column with none type
cdata = cdata.dropna()
cdata.head()

# now check the details of the dataframe after removing the missing data
cdata.info()

# check the states with the most cases
cdata.state.value_counts().to_frame()

"""We see that there are a few states with blank name. Lets check out if the name of the corresponding countries are blank or not."""

cdata[cdata['state']== '']

"""Since the corresponding countries' name is not blank we will not drop the columns with the blank state name. Instead we will replace the blank with the name : {country_name}_state"""

# renaming the blank states

for i, row in cdata.iterrows():

    if row['state'] == '':
        cdata.loc[i, 'state'] = row['country'] + '_state'

cdata.head(20)

# save the cleaned dataset as csv file.
cdata.to_csv('clean_data_x_lat.csv')

"""# Exploratory data analysis

## Top 10 states with the highest patient count
"""

# Commented out IPython magic to ensure Python compatibility.
import matplotlib.pyplot as plt
# %matplotlib inline

print('Total confirmed cases:', len(cdata))

cdata.state.value_counts().sort_values(ascending = False)[:10][::-1].plot(kind='barh',width=0.8, alpha = 0.8,
                                                                          color= 'crimson', fontsize=15, figsize=(12,8))
plt.show()

"""## Top 10 countries with the highest patient count"""

cdata.country.value_counts().sort_values(ascending = False).head(10)[::-1].plot(kind='barh',width=0.8, alpha = 0.8,
                                                                          color= 'forestgreen', fontsize=15, figsize=(12,8))
plt.show()

"""# Making cluster and plotting them on a map"""

from sklearn.cluster import KMeans

def run_kmeans(k):
    '''
    This function takes in a K-value and returns a dataframe with cluster values of each row and a list of cluster centers.

    The K-means algorithm makes the cluster based on two columns namely 'longitude' and 'latitude' column
    of cdata (modified dataframe).It.
    '''

    data = cdata[['latitude','longitude']]
    kmeans = KMeans(k)
    kmeans.fit(data)
    new_data = data.copy()
    new_data['cluster']= kmeans.labels_
    cc = kmeans.cluster_centers_
    #print(kmeans.labels_)
    #print(kmeans.cluster_centers_)

    return new_data, cc

### running k means for 10 cluster
n, cc=run_kmeans(10)
n.head()

# scatter plot showing the clusters in the data
plt.figure(figsize=(12,6))
plt.scatter(n['latitude'],n['longitude'], c=n['cluster'])
plt.xlabel('Latitudes')
plt.ylabel('Longitudes')
plt.title('plot showing the clusters in the data')
plt.show()

"""## Plotting Cluster centers on a Map

# TYPE 1: slider for K-value

# style 1
"""

from ipywidgets import interact

def submit(K_value):

    n, cc = run_kmeans(K_value)

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go

    fig = go.Figure()

    for i in range(len(cc_df)):
        df_sub = n[n['cluster']==i]
        fig.add_trace(go.Scattergeo(
            locationmode = 'country names',
            lon = [cc_df['longitude'][i]],
            lat = [cc_df['latitude'][i]],
            text = 'cluster count :'+ str(len(df_sub)),
            marker = dict(
                size = len(df_sub)/10,
                color = 'orangered',
                line_color='white',
                line_width=0.5,
                sizemode = 'area'
            ),name = '{0}'.format(i)))

    fig.update_layout(
        title_text = 'Geographical distribution of COVID cases<br>(Click legend to toggle traces)',
        showlegend = True,
        width = 700,
        height = 800,
        geo = dict(
            scope = 'africa',
            landcolor = "silver",
            countrycolor = "darkslategray" ,
            showocean = True,
            oceancolor ="darkslategray",
            projection_type = "orthographic"

        )
    )

    fig.show()

print('adjust the value of K by sliding')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

"""# style 2"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    n, cc = run_kmeans(K_value)

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(cc_df, lat=cc[:,0], lon=cc[:,1], hover_name=cc_df.index, hover_data=['cluster_count'],
                        color_discrete_sequence=["red"], zoom=3, height=300, size = (cc_df['cluster_count']/len(cc_df))*0.2)

    fig.update_layout(mapbox_style="open-street-map", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()





print('adjust the value of K by sliding:')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

"""# style 3"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    n, cc = run_kmeans(K_value)

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(cc_df, lat=cc[:,0], lon=cc[:,1], hover_name=cc_df.index, hover_data=['cluster_count'],
                        color_discrete_sequence=["red"], zoom=3, height=300, size = (cc_df['cluster_count']/len(cc_df))*0.2)

    fig.update_layout(mapbox_style="carto-darkmatter", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()





print('adjust the value of K by sliding:')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

"""# style 4"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    n, cc = run_kmeans(K_value)

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(cc_df, lat=cc[:,0], lon=cc[:,1], hover_name=cc_df.index, hover_data=['cluster_count'],
                        color_discrete_sequence=["red"], zoom=3, height=300, size = (cc_df['cluster_count']/len(cc_df))*0.2)

    fig.update_layout(mapbox_style="stamen-terrain", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()





print('adjust the value of K by sliding:')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.



"""# TYPE 2: Text box for K-value

# style 1
"""

import ipywidgets
from ipywidgets import interact,interactive, widgets

def submit(K_value):
    if K_value =='':
        K_value = 3 # default K value for the map

    K = int(K_value)
    n, cc = run_kmeans(K)

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    for i in range(len(cc_df)):
        df_sub = n[n['cluster']==i]

        fig.add_trace(go.Scattergeo(
            locationmode = 'country names',
            lon = [cc_df['longitude'][i]],
            lat = [cc_df['latitude'][i]],
            text = 'cluster count :'+ str(len(df_sub)),
            marker = dict(
                size = len(df_sub)/20,
                color = 'orangered',
                line_color='white',
                line_width=0.5,
                sizemode = 'area'
            ),name = '{0}'.format(i)))

    fig.update_layout(
        title_text = 'Geographical distribution of COVID cases<br>(Click legend to toggle traces)',
        showlegend = True,
        width = 700,
        height = 800,
        geo = dict(
            scope = 'africa',
            landcolor = "silver",
            countrycolor = "darkslategray" ,
            showocean = True,
            oceancolor ="darkslategray",
            projection_type = "orthographic"

        )
    )

    fig.show()


print('Enter the value for K in the text box given below:')

interact(submit, K_value= widgets.Text(
    value='3',
    description='K-value',
    disabled=False
    ), continuous_update = False);

"""# Style 2"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    if K_value =='':
        K_value = 3 # default K value for the map
    n, cc = run_kmeans(int(K_value))

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(cc_df, lat=cc[:,0], lon=cc[:,1], hover_name=cc_df.index, hover_data=['cluster_count'],
                        color_discrete_sequence=["red"], zoom=3, height=300, size = (cc_df['cluster_count']/len(cc_df))*0.2)

    fig.update_layout(mapbox_style="stamen-terrain", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()



print('Enter the value for K in the text box given below:')
interact(submit, K_value= widgets.Text(
    value='3',
    description='K-value',
    disabled=False
    ), continuous_update = False); # change the range of slider for K value here.

"""# Plotting all the data points instead of cluster centers

# Style 1
"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    n, cc = run_kmeans(K_value)
    n['state'] = cdata['state']
    n['country'] = cdata['country']
    cluster = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = cluster['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(n, lon=n['longitude'], lat=n['latitude'], color = n['cluster'], zoom=3, height=300,
                           hover_name='state= '+n['state'] +'<br>'+'country= '+n['country'] )

    fig.update_layout(mapbox_style="stamen-terrain", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()





print('adjust the value of K by sliding:')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

"""# style 2"""

from ipywidgets import interact

def submit(K_value):

    n, cc = run_kmeans(K_value)

    n['state'] = cdata['state']
    n['country'] = cdata['country']

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    n['text'] = 'State: '+n['state']+'<br>'+'Country: '+n['country']+'<br>'+'Cluster: '+n['cluster'].astype(str)

    import plotly.graph_objects as go
    fig = go.Figure()


    fig = go.Figure(data=go.Scattergeo(
        locationmode = 'country names',
        lon = n['longitude'],
        lat = n['latitude'],
        text = n['text'],
        marker = dict(
            color = n['cluster'],
            line_width=0.5,
            sizemode= 'area'
        ),name = '{0}'.format(i)))




    fig.update_layout(
        title_text = 'Geographical distribution of COVID cases<br>(Click legend to toggle traces)',
        showlegend = False,
        width = 700,
        height = 800,
        geo = dict(
            scope = 'africa',
            landcolor = "silver",
            countrycolor = "darkslategray" ,
            showocean = True,
            oceancolor ="darkslategray",
            projection_type = "orthographic"

        )
    )

    fig.show()

print('adjust the value of K by sliding')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

"""# style 3"""

from ipywidgets import interact,interactive
import plotly.express as px

def submit(K_value):

    n, cc = run_kmeans(K_value)
    n['state'] = cdata['state']
    n['country'] = cdata['country']

    labels = n.groupby('cluster')['cluster'].count().to_frame()
    cc_df = pd.DataFrame(cc, columns = ['latitude','longitude'])
    cc_df['cluster_count'] = labels['cluster']

    import plotly.graph_objects as go


    fig = go.Figure()

    fig = px.scatter_mapbox(n, lon=n['longitude'], lat=n['latitude'], color = n['cluster'], zoom=3, height=300,
                           hover_name='state= '+n['state'] +'<br>'+'country= '+n['country'] )


    fig.update_layout(mapbox_style="carto-darkmatter", width= 900, height=800,
                     title_text= 'Geographical distribution of COVID cases',
                     showlegend = True)

    fig.update_layout(margin={"r":0,"l":0, "b":0})
    fig.show()





print('adjust the value of K by sliding:')
interact(submit, K_value=(2,51,1), continuous_update = False); # change the range of slider for K value here.

adjust the value of K by sliding:
interactive(children=(IntSlider(value=26, description='K_value', max=51, min=2), Output()), _dom_classes=('wid…

Video

If you require any assistance with your Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.

Utilizing KMeans for Geospatial Data Analysis and Clustering

Introduction

What You Will Learn

Prerequisites

Loading Data

Finding Locations

Cleaning the Data

Exploratory Data Analysis

Breakdown:

Breakdown:

Clustering and Visualization

Applying KMeans

Breakdown:

This code generates a scatter plot that visualizes how the data points are grouped into clusters based on their geographical coordinates, with different colors representing different clusters.

Plotting Cluster Centers on a Map

Style 1: Slider for K-value

Style 2: Text Box for K-value

Putting it All Together

Video

Recent Posts

Comentarios