top of page

Clustering Analysis using Mixture of Gaussians


Introduction

In this blog, we introduce a new project focusing on the project requirement titled "Clustering Analysis using Mixture of Gaussians". We'll walk you through the project requirements, highlighting the tasks at hand. Then, in the solution approach section, we'll delve into what we've will provide, discussing the techniques we will apply and the steps will take.


Let's get started!


Project Requirement


Aim 

The aim of this assignment is to become familiar with clustering using the Mixture of Gaussians model. 


1 Introduction 

For this lab, we will use the Peterson and Barney’s dataset of vowel formant frequencies. (For more info, look at Classification of Peterson & Barney’s vowels using Weka. - a copy of this article is at QMplus) 


More specifically, Peterson and Barney measured the fundamental frequency F0 and the first three formant frequencies (F1 − F3) of sustained English Vowels, using samples from various speakers. 

The dataset can be found in the QMplus, in the files of Assignment 2, in the folder named “data”, at the file “PB data.npy”. Load the file. In your workspace, you will have 4 vectors (F0 − F3), containing the funda mental frequencies (F0, F1, F2 and F3) for each phoneme and another vector “phoneme id” containing a number representing the id of the phoneme. The arrangement of the data is as follows: 


phoeneme ID 

F

F

F

F3

xxx 

xxx 

xxx 

xxx

... 

xxx 

xxx 

xxx 

xxx

10 

xxx 

xxx 

xxx 

xxx

... 

xxx 

xxx 

xxx 

xxx

xxx 

xxx 

xxx 

xxx



In the exercises that follow, we will use only the dataset associated with formants F1 and F2. 


2 MoG Modelling using the EM Algorithm 


Recall the following definition of a Mixture of Gaussians. Assuming our observed random vector is x, a MoG models p(x) as a sum of weighted Gaussians. More specifically: 

p(x) =  k=1 

p(ck

(2π)D/2det (Σk)1/2exp 12(x − µk)Σk−1(x − µk)(1) 1

Where D is the dimension of vector x RD, µk, Σk and p(ck) are the mean vector, covariance matrix, and the weight of the k-th gaussian component, and K is the total number of gaussian components used. 


Task 1: 


Load the dataset to your workspace. We will only use the dataset for F1 and F2, arranged into a 2D matrix where the first column will be F1 and the second column will be F2. Produce a plot of F1 against F2. (You should be able to spot some clusters already in this scatter plot.). 

Include in your report the corresponding lines of your code and the plot.


Task 2: 

Train the data for phonemes 1 and 2 with MoGs. You are provided with python files task 2.py and plot gaussians.py. Specifically, you are required to: 


1. Look at the task 2.py code and understand what it is calculating. Pay particular attention to the initiali sation of the means and covariances (also note that it is only estimating diagonal covariances). 

2. Generate a dataset X phoneme 1 that contains only the F1 and F2 for the first phoneme. 

3. Run task 2.py on the dataset using K=3 Gaussians (run the code a number of times and note the differences.) Save your MoG model: this should comprise the variables mu, s and p. 

4. Run task 2.py on the dataset using K=6 

5. Repeat steps 2-4 for the second phoneme 


Include in your report the lines of code you wrote, and results that illustrate the learnt models. 

Task 3: 

Use the 2 MoGs (K=3) learnt in task 2 to build a classifier to discriminate between phonemes 1 and 2. Clas sify using the Maximum Likelihood (ML) criterion (feel free to hack parts from the MoG code in task 2.py so that you calculate the likelihood of a data vector for each of the two MoG models) and calculate the miss-classification error. Remember that a classification under the ML compares p(x; θ1), where θ1 are the parameters of the MoG learnt for the first phoneme, with p(x; θ2), where θ2 are the parameters of the MoG learnt for the second phoneme. 


Repeat this for K = 6 and compare the results. 

Include in your report the lines of the code that your wrote, explanations of what the code does and comment on the differences on the classification performance 


Task 4: 

Create a grid of points that spans the two datasets. Classify each point in the grid using one of your classifiers. That is, create a classification matrix, M, whose elements are either 1 or 2. M(i, j) is 1 if the point x1 is classified as belonging to phoneme 1, and is 2 otherwise. x1 is a vector whose elements are between the minimum and the maximum value of F1 for the first two phonemes, and x2 similarly for F2. 


Display the classification matrix. Include the lines of code in your report, comment them, and display the classification matrix. 



Task 5: 

In the code of task 5.py a MoG with a full covariance matrices is fit to the data. Now, create a new dataset that will contain 3 columns, as follows: 

X = [F1, F2, F1 + F2] (2) 


Fit a MoG model to the new data. What is the problem that you observe? Explain why. Suggest ways of overcoming the singularity problem and implement them. 

Include the lines of code in your report, and graphs/plots so as to support your observations. 


Write a report about what you have done, along with relevant plots. Save the solution in a folder with your ID. Create and submit a .zip that contains: 

1. all of your code and 

2. a copy of your report. The report should be in .pdf format



Solution Approach 


In this project, we employed a variety of methods and techniques to tackle the task at hand effectively. Let's delve into the key components of our solution approach:

  • Data Exploration and Visualization: We began by exploring and visualizing the Peterson and Barney's dataset of vowel formant frequencies. Understanding the data distribution and patterns was crucial for subsequent analysis.

  • Mixture of Gaussians (MoG) Model: Our primary modeling technique revolved around the Mixture of Gaussians model. This probabilistic model allowed us to represent complex data distributions as a weighted sum of Gaussian components.

  • EM Algorithm: To train the MoG model, we utilized the Expectation-Maximization (EM) algorithm. This iterative approach enabled us to estimate the model parameters, including means, covariances, and component weights, by maximizing the likelihood of the observed data.

  • Classifier Construction: Leveraging the trained MoG models, we constructed classifiers to discriminate between different phonemes. Using the Maximum Likelihood criterion, we calculated the likelihood of data vectors for each phoneme model and classified them accordingly.

  • Grid Classification: We extended our analysis by creating a grid of points spanning the datasets. Each point was classified using our trained classifiers, resulting in a classification matrix that provided insights into the distribution of phonemes in the feature space.

  • Handling Singularity Issues: In addressing singularity problems encountered during model fitting, we explored techniques to overcome these challenges. This included feature engineering and alternative covariance matrix estimation methods.

Visualization and Interpretation: Throughout the project, visualization played a pivotal role in interpreting results and gaining insights. We employed various plotting techniques to visualize data distributions, Gaussian components, classification results, and more.


Benefits of Our Approach

At CodersArts , we don't just meet project requirements; we excel in delivering tailored solutions that drive tangible results for our clients. Here's why partnering with us for your clustering analysis needs is a smart choice:


  1. Expertise in Machine Learning: Our team comprises seasoned experts in machine learning techniques, with a deep understanding of clustering analysis using advanced models like the Mixture of Gaussians. With our expertise, you can trust that your project is in capable hands, ensuring optimal outcomes.

  2. Customized Solutions: We recognize that every project is unique, and that's why we take a customized approach to meet your specific needs. Whether it's fine-tuning parameters or adapting methodologies, we tailor our solutions to align perfectly with your objectives.

  3. Efficient Project Execution: Time is of the essence, and we understand the importance of timely project delivery. With streamlined processes and efficient project management, we ensure that your project progresses smoothly from inception to completion, without compromising on quality.

  4. Insightful Analysis: Beyond just numbers and algorithms, we provide insightful analysis that goes beyond the surface. Our team digs deep into the data to uncover meaningful patterns and trends, empowering you with actionable insights to make informed decisions.

  5. Ongoing Support and Collaboration: Our commitment to your success extends beyond project completion. We foster collaborative partnerships with our clients, offering ongoing support and guidance to help you leverage the results of our analysis for continued growth and success.


If you require any assistance with the project discussed in this blog, or if you find yourself in need of similar support for other projects, please don't hesitate to reach out to us. Our team can be contacted at any time via email at contact@codersarts.com.

Comments


bottom of page