Building a Data Analysis Application with Python

Aug 23, 2024

Introduction

In this tutorial, we will walk through the steps of creating a GUI-based application for data analysis using Python. The application allows users to load, clean, and analyze data, as well as visualize results. We will use libraries such as pandas, matplotlib, seaborn, and tkinter to achieve this.

Prerequisites

Before we begin, make sure you have the following libraries installed:

Numpy
webbrowser
Tkinter
Matplotlib

Implementation

We start by importing the necessary Python libraries. These libraries will help us with data manipulation, visualization, and creating the graphical user interface (GUI).

import numpy as np
import pandas as pd
import re
import webbrowser
from tkinter import *
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from matplotlib.figure import Figure
import base64
from io import BytesIO

NumPy (numpy) and Pandas (pandas): Used for data manipulation and analysis.
re: Provides support for regular expressions, useful for text processing.
webbrowser: Allows opening URLs in a web browser.
Tkinter (tkinter): A standard Python library used to create Graphical User Interfaces (GUIs).
Matplotlib (matplotlib.pyplot) and Seaborn (seaborn): Libraries for creating visualizations.
IPython display (IPython.display): Used to display rich media, such as HTML, in Jupyter notebooks.
Base64 (base64) and BytesIO (io.BytesIO): Handle binary data and encoding images for embedding in HTML.

Defining Functions

Data Loading and Cleaning

def load_and_clean():
    filepath1 = t1.get()
    filepath2 = t2.get()
    filepath3 = t3.get()
    global inventory_df, inspections_df, violations_df

    # Load data
    inventory_df = pd.read_csv(filepath1)
    inspections_df = pd.read_csv(filepath2)
    violations_df = pd.read_csv(filepath3)

    # Clean the data
    inventory_df = inventory_df.rename(columns={" PROGRAM NAME": "PROGRAM NAME", "FACILITY  STATE": "FACILITY STATE"})
    inventory_df = inventory_df.dropna().drop_duplicates()
    inspections_df = inspections_df.dropna().drop_duplicates()
    violations_df = violations_df.rename(columns={"VIOLATION  STATUS": "VIOLATION STATUS"})
    violations_df = violations_df.dropna().drop_duplicates()

    print("Data loaded and initial cleaning done.")

Function Purpose: This function loads three CSV files provided by the user and performs basic data cleaning.
Loading Data: The data from the file paths entered in the GUI (t1, t2, t3) is loaded into Pandas DataFrames (inventory_df, inspections_df, violations_df).
Data Cleaning:
- Renaming Columns: The code renames columns that have extra spaces in their names.
- Handling Missing Data: The function removes any rows with missing values (dropna()).
- Removing Duplicates: The function removes duplicate rows (drop_duplicates()).

Data Preparation

def prepare_data():

    global merge2
    # Merge inventory and inspections data

    cols_to_use = list(inspections_df.columns.difference(inventory_df.columns))
    cols_to_use.append('FACILITY ID')
    merge = inventory_df.merge(inspections_df[cols_to_use].set_index('FACILITY ID'), on='FACILITY ID', how='inner',

                               suffixes=('_inventory', '_inspection'))

    merge = merge.dropna().drop_duplicates()
    merge = merge[merge['PROGRAM STATUS'] == 'ACTIVE']

    # Extract seating information
    merge['SEATINGS'] = merge['PE DESCRIPTION'].str.extract(r'\((.*)\)')
    merge['PE DESCRIPTION'] = merge['PE DESCRIPTION'].str.replace(r' (\(.*\))', '')

    # Merge with violations data
    merge2 = merge.merge(violations_df.set_index('SERIAL NUMBER'), on='SERIAL NUMBER', how='inner')
    print("Data prepared.")

    # Save prepared data
    merge2.to_json(filepath + "prepared_data.json", orient='split', compression='infer')

    print("File saved.")

Function Purpose: This function prepares the loaded data for analysis by merging and cleaning it further.
Merging Data:
- The function merges inventory_df and inspections_df on the FACILITY ID column.
- It ensures that only active programs are retained (merge['PROGRAM STATUS'] == 'ACTIVE').
Extracting Seating Information:
- SEATINGS: Extracts information in parentheses from the PE DESCRIPTION column using a regular expression.
- PE DESCRIPTION: Removes the extracted information from the original column.
Final Merge: The function merges the resulting DataFrame with the violations_df on SERIAL NUMBER.
Saving Data: The cleaned and merged data is saved as a JSON file (prepared_data.json).

Mean, Median, and Mode Calculation

def mean_median_mode(column_name):

    # Calculate mean, median, and mode

    mean = merge2.groupby([column_name])['SCORE'].agg('mean').to_frame()
    median = merge2.groupby([column_name])['SCORE'].agg('median').to_frame()
    mode = merge2.groupby([column_name])['SCORE'].agg(lambda x: x.value_counts().index[0]).to_frame()

    # Combine results into a single dataframe
    global data

    data = pd.concat([mean, median, mode], axis=1)
    data.columns = ['mean', 'median', 'mode']
    data.reset_index(level=0, inplace=True)
    # Display results in an HTML file

    html = data.to_html()
    with open("html_file.html", "w") as text_file:
        text_file.write(html)
    webbrowser.open_new_tab("html_file.html")

    print("Mean, median, mode displayed.")

Function Purpose: This function calculates the mean, median, and mode of the SCORE column for different categories (e.g., seating types or zip codes).
Calculating Statistics:
- Mean: The average score for each category.
- Median: The middle score when the scores are sorted.
- Mode: The most frequent score for each category.
Combining Results: The results are combined into a single DataFrame with columns for mean, median, and mode.
Displaying Results: The DataFrame is saved as an HTML file and opened in a web browser.

Plotting Graphs

def plot_graph():
    fig = plt.figure(figsize=(12, 6))

    merge2.groupby('VIOLATION CODE')['FACILITY ID'].agg(lambda x: x.value_counts().sum()).plot(kind='bar', title='Number of establishments that have committed each type of violation')

    tmpfile = BytesIO()

    fig.savefig(tmpfile, format='png')

    encoded = base64.b64encode(tmpfile.getvalue()).decode('utf-8')
    html = '<img src="data:image/png;base64,{}">'.format(encoded)
    with open('test.html', 'w') as f:
        f.write(html)

    webbrowser.open_new_tab('test.html')

Function Purpose: This function generates a bar chart that shows the number of establishments committing each type of violation.
Plotting the Graph:
- The data is grouped by VIOLATION CODE, and the count of FACILITY ID is plotted as a bar chart.
Embedding Image:
- The chart is saved as a PNG image in memory (BytesIO).
- The image is encoded in base64 and embedded into an HTML file.
Displaying the Plot: The HTML file is opened in a web browser to display the plot.

Creating the GUI

Let's now move on to the GUI section, where we will create the code for the interactive interface.

Screen Preparation

def prepare_screen():

    window2 = Toplevel(window)
    window2.title("Data Preparation")

    heading = Label(window2, text="Data Preparation", bg="plum4", fg="White", width="700", height="2", font=("Helvetica", 12), anchor="center")
    heading.pack()

    window2.configure(bg='gray80')
    window2.geometry("700x250")
    window2.attributes('-topmost', 'true')

    lbl4 = Label(window2, text="Click on the button below to prepare data for analysis and save file as JSON", bg='gray80', fg='black', font=("Helvetica", 10))
    lbl4.place(x=40, y=70)

    prepare_button = Button(window2, text='Prepare Data', bg="plum4", fg="white", font=("Helvetica", 10), command=prepare_data)
    prepare_button.place(x=240, y=120, width=180, height=40)

    next_button = Button(window2, text="Next", command=output_screen, fg="White", bg="plum4", font=("Helvetica", 10))
    next_button.place(x=300, y=180, width=100, height=25)

Purpose: This function creates a new window in the GUI for preparing data.
Creating a New Window (Toplevel): A new top-level window (window2) is created on top of the main window (window).
Adding Labels and Buttons:
- Heading: A label displaying "Data Preparation" at the top of the window.
- Instructions: A label providing instructions to the user.
- Buttons: Two buttons are added: one for preparing data (prepare_data) and another for proceeding to the next step (output_screen).

Running the Application

Main Window (Tk()): This creates the main window (window) for the application.
Window Title and Layout:
- The title is set to "Data Analysis Application."
- The main window's background is set to a light gray color (gray80), and it is configured to have a fixed size (700x350).

window = Tk()

window.title("Data Analysis Application")

heading = Label(window, text="Data Analysis Application", bg="plum4", fg="White", width="700", height="2", font=("Helvetica", 12), anchor="center")
heading.pack()

window.configure(bg='gray80')
window.geometry("700x350")

Labels and Input Fields:
- Labels prompt the user to enter the file paths for the Inventory, Inspections, and Violations files.
- Entry fields (t1, t2, t3) allow the user to input the file paths.

label1 = Label(window, text="Please enter file path to the Inventory File:", bg="gray80", font=("Helvetica", 10))
label1.place(x=20, y=70)

t1 = Entry(window, width=40)
t1.place(x=300, y=70)

label2 = Label(window, text="Please enter file path to the Inspections File:", bg="gray80", font=("Helvetica", 10))
label2.place(x=20, y=110)

t2 = Entry(window, width=40)
t2.place(x=300, y=110)

label3 = Label(window, text="Please enter file path to the Violations File:", bg="gray80", font=("Helvetica", 10))
label3.place(x=20, y=150)

t3 = Entry(window, width=40)
t3.place(x=300, y=150)

Buttons:
- The "Load Data" button triggers the load_and_clean() function to load and clean the data.
- The "Next" button triggers the prepare_screen() function to move to the data preparation screen.
window.mainloop(): This starts the Tkinter event loop, making the window responsive to user actions.

Button(window, text="Load Data", command=load_and_clean, fg="White", bg="plum4", font=("Helvetica", 10)).place(x=300, y=200, width=100, height=25)
Button(window, text="Next", command=prepare_screen, fg="White", bg="plum4", font=("Helvetica", 10)).place(x=300, y=250, width=100, height=25)

window.mainloop()

Putting it All Together

import numpy as np
import pandas as pd
import re
import webbrowser
from tkinter import *
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from matplotlib.figure import Figure
import base64
from io import BytesIO

#-----------------------------------------------FUNCTIONS-------------------------------------------------------------------

def load_and_clean():
    filepath1=t1.get()
    filepath2=t2.get()
    filepath3=t3.get()
    global inventory_df,inspections_df,violations_df
    #load
    inventory_df = pd.read_csv(filepath1)
    inspections_df = pd.read_csv(filepath2)
    violations_df = pd.read_csv(filepath3)

    #clean the data
    inventory_df=inventory_df.rename(columns={" PROGRAM NAME": "PROGRAM NAME","FACILITY  STATE":"FACILITY STATE"})
    inventory_df=inventory_df.dropna()
    inventory_df=inventory_df.drop_duplicates()

    inspections_df=inspections_df.dropna()
    inspections_df=inspections_df.drop_duplicates()

    violations_df=violations_df.rename(columns={"VIOLATION  STATUS":"VIOLATION STATUS"})
    violations_df=violations_df.dropna()
    violations_df=violations_df.drop_duplicates()
    print("Data loaded and initial cleaning done.")



def prepare_data():
    '''
    function to prepare the data to generate the output
    '''
    global merge2
    # merging and inventory and inspections data frame
    # also removing repetitive columns from the inspection_df and keeping only one common column (FACILITY ID) for merging
    cols_to_use = list(inspections_df.columns.difference(inventory_df.columns))
    cols_to_use.append('FACILITY ID')

    merge = inventory_df.merge(inspections_df[cols_to_use].set_index('FACILITY ID'), on='FACILITY ID', how='inner',
                              suffixes=('_inventory', '_inspection'))

    # removing missing and duplicate data from the merged dataframe
    merge = merge.dropna()
    merge = merge.drop_duplicates()

    # keeping rows with PROGRAM STATUS==ACTIVE
    merge = merge[merge['PROGRAM STATUS']=='ACTIVE']

    # extracting seat number and type into another column named SEATINGS
    merge['SEATINGS'] = merge['PE DESCRIPTION'].str.extract(r'\((.*)\)')
    merge['PE DESCRIPTION'] = merge['PE DESCRIPTION'].str.replace(r' (\(.*\))','')

    # Merging merge and violations on SERIAL NUMBER
    merge2 = merge.merge(violations_df.set_index('SERIAL NUMBER'), on='SERIAL NUMBER', how='inner')

    print("Data prepared.")

    merge2.to_json(filepath+"prepared_data.json", orient = 'split', compression = 'infer')
    print("file saved.")



def mean_median_mode(column_name):
    '''
    function to Produce the mean, mode and median for the inspection score per year:
        For each type of vendor’s seating
        For each ‘zip code’
    '''

    # group scores by zip code
    # mean
    mean = merge2.groupby([column_name])['SCORE'].agg('mean').to_frame()
    # median
    median = merge2.groupby([column_name])['SCORE'].agg('median').to_frame()
    # mode
    mode=merge2.groupby([column_name])['SCORE'].agg(lambda x:x.value_counts().index[0]).to_frame()

    # merging into a single dataframe
    global data
    data = pd.concat([mean,median,mode], axis=1)
    data.columns = ['mean','median','mode']
    data.reset_index(level=0, inplace=True)

    html=data.to_html()

    text_file=open("html_file.html","w")
    text_file.write(html)
    text_file.close()

    webbrowser.open_new_tab("html_file.html")
    print("mean, median, mode, displayed")


def plot_graph():
    fig = plt.figure(figsize=(12,6))

    merge2.groupby('VIOLATION CODE')['FACILITY ID'].agg(lambda x:x.value_counts().sum()).plot(kind='bar',title='Number of establishments that have committed each type of violation')
    tmpfile = BytesIO()
    fig.savefig(tmpfile, format='png')
    encoded = base64.b64encode(tmpfile.getvalue()).decode('utf-8')

    html = 'Some html head' + '<img src=\'data:image/png;base64,{}\'>'.format(encoded) + 'Some more html'

    with open('test.html','w') as f:
        f.write(html)


    webbrowser.open_new_tab('test.html')


def plot_correlation():
    data=merge2.groupby(['Zip Codes','FACILITY ID'])['VIOLATION CODE'].agg(lambda x:x.value_counts().sum()).to_frame()
    data.reset_index(level=0, inplace=True)
    data=data.rename(columns={'VIOLATION CODE':'VIOLATION COUNT'})
    data.reset_index(level=0, inplace=True)

    tpfile = BytesIO()
    fig = plt.figure()

    data.plot(kind='scatter',x='Zip Codes',y='VIOLATION COUNT',figsize=(12,6),title='Plot of number of violations committed per vendor and their zip code').get_figure().savefig(tpfile, format='png')



    encoded = base64.b64encode(tpfile.getvalue()).decode('utf-8')

    html = '' + '<img src=\'data:image/png;base64,{}\'>'.format(encoded) + ''

    with open('image.html','w') as f:
        f.write(html)


    webbrowser.open_new_tab('image.html')
    print('Scatter plot displayed')


#----------------------------------------GUI----------------------------------------------------------------------------------


def prepare_screen():


    def output_screen():
        window3=Toplevel(window2)
        heading=Label(window3,text="Generate Outputs",bg="plum4",fg="White",width="700",height="2",font=("Helvetica", 12),anchor="center")
        heading.pack()
        window3.configure(bg='gray80')
        window3.geometry("700x350")
        window3.attributes('-topmost', 'true')

        lbl5 = Label(window3,text="View mean, median mode inspection score for:",bg='gray80',fg='black',font=("Helvetica", 10))
        lbl5.place(x=40, y=70)

        seating_button = Button(window3, text='each type of vendor\'s seating',bg="plum4",fg="white",font=("Helvetica", 10),
                                command= lambda:mean_median_mode('SEATINGS'))
        seating_button.place(x=80,y=100,width=300,height=35)

        zip_button= Button(window3, text='each type of zip code',bg="plum4",fg="white",font=("Helvetica", 10),
                           command= lambda:mean_median_mode('Zip Codes'))
        zip_button.place(x=80,y=150,width=300,height=35)

        lbl6 =Label(window3,text="Click on 'Show Graph' button to view graph that displays the number \nof establishments that have committed each type of violation. ",
                     bg='gray80', fg='black', font=("Helvetica", 10),justify=CENTER)
        lbl6.place(x=40, y=230)

        graph_button= Button(window3, text='Show Graph',bg="plum4",fg="white",font=("Helvetica", 10),
                           command= plot_graph)
        graph_button.place(x=480,y=230,width=150,height=35)

        lbl7 =Label(window3,text="Click on 'View' button to get the scatter plot between number of \nviolations committed per vendor and their zip code ",
                     bg='gray80', fg='black', font=("Helvetica", 10),justify=CENTER)
        lbl7.place(x=40, y=280)

        corr_button= Button(window3, text='View',bg="plum4",fg="white",font=("Helvetica", 10),
                           command= plot_correlation)
        corr_button.place(x=480,y=280,width=150,height=35)



    window2=Toplevel(window)
    heading=Label(window2,text="Prepare and Save data",bg="plum4",fg="White",width="700",height="2",font=("Helvetica", 15),
                  anchor="center")
    heading.pack()
    window2.configure(bg='gray80')
    window2.geometry("700x350")
    window2.attributes('-topmost', 'true')


    message=Label(window2,text="The data has been loaded into the application and discrepancies have been removed.To prepare\n the final dataset please provide the path to the destination where you want to save \nthe prepared dataset and click \"Prepare and Save\" button given below.\nThe dataset will be stored as a json file.",
                 bg='gray80', fg='black', font=("Helvetica", 11),justify=CENTER)
    message.place(x=50,y=100)

    lbl4=Label(window2,text="Enter path for output:",bg='gray80', fg='black', font=("Helvetica", 11))
    lbl4.place(x=50, y=200)
    lbl4_var=StringVar()
    t4=Entry(window2,bd=2,textvariable=lbl4_var)
    t4.place(x=200,y=200,width=450)

    global filepath
    filepath=t4.get()
    prepare_button1= Button(window2, text='Prepare and Save',bg="plum4",fg="white",font=("Helvetica", 14),
                            command=lambda:[prepare_data(),output_screen()])
    prepare_button1.place(x=150,y=270,width=200,height=50)

    back_button = Button(window2, text="Back",bg="plum4",fg="white",font=("Helvetica", 14),command=window2.destroy)
    back_button.place(x=400,y=270,width=200,height=50)


window=Tk()

window.title('Application')
window.configure(bg='gray80')
heading=Label(text="Load and Clean initial data",bg="plum4",fg="White",width="700",height="2",font=("Helvetica", 15),anchor="center")
heading.pack()
window.geometry("700x350")

lbl1=Label(window,text="Enter path for inventory file:",bg='gray80', fg='black', font=("Helvetica", 11))
lbl1.place(x=50, y=100)
lbl1_var=StringVar()
t1=Entry(bd=2,textvariable=lbl1_var)
t1.place(x=235,y=100,width=450)

lbl2=Label(window,text="Enter path for inspection file:",bg='gray80', fg='black', font=("Helvetica", 11))
lbl2.place(x=42, y=150)
lbl2_var=StringVar()
t2=Entry(bd=2,textvariable=lbl2_var)
t2.place(x=235,y=150,width=450)

lbl3=Label(window,text="Enter path for violations file:", bg='gray80',fg='black', font=("Helvetica", 11))
lbl3.place(x=50, y=200)
lbl3_var=StringVar()
t3=Entry(bd=2,textvariable=lbl3_var)
t3.place(x=235,y=200,width=450)

load_button1= Button(window, text='Load and Clean',bg="plum4",fg="white",font=("Helvetica", 14),command=lambda:[load_and_clean(),prepare_screen()])
load_button1.place(x=150,y=270,width=200,height=50)

exit_button = Button(window, text="Exit",bg="plum4",fg="white",font=("Helvetica", 14),command=window.destroy)
exit_button.place(x=400,y=270,width=200,height=50)

window.mainloop()

If you require any assistance with your Machine Learning projects, please do not hesitate to contact us. We have a team of experienced developers who specialize in Machine Learning and can provide you with the necessary support and expertise to ensure the success of your project. You can reach us through our website or by contacting us directly via email or phone.