In this blog post we will be discussing one particular NLP technique known as ‘Text clustering’. It is an unsupervised method and it provides valuable insights about the data (textual) without having to label it first. How it does so? Let’s find out.
Clustering is the process of making groups of similar objects. The objects in a group should be very similar to each other and each group should be very dissimilar. Text clustering is the task of grouping a set of unlabeled texts in such a way that texts in the same cluster are more similar to each other than to those in other clusters. Text clustering algorithms process text and determine if natural clusters (groups) exist in the data.
The text clustering algorithm works in five stages listed below:
Transformations on raw stream of free flow text
Feature extraction
Clustering using Euclidean Distances
Auto-Tagging based on Cluster Centers
We will discuss these steps one by one.
If you are familiar with clustering then you know that the process of clustering involves taking some data, applying the clustering algorithm of your choice and getting results. But when it comes to cluster similar texts together, the process is not so straightforward. To get a meaningful result isn’t that easy and requires a number of pre-processing steps.
Transformations on raw stream of free flow text
The first and most important step in any NLP model is to bring your text into a form that is predictable and analyzable so that machine learning algorithms can perform better.
The various text pre-processing steps are:
Tokenization: splitting the sentence into words.
Lower casing: converting a word to lower case (as the words ‘Apple’ and ‘apple’ would be considered different in vector space).
Stop words removal: Stop words are very commonly used words (a, an, the, etc.) in the documents. These words do not really signify any importance as they do not help in distinguishing two documents.
Stemming: It is a process of transforming a word to its root form. OR Lemmatization: It is used for the same purpose as stemming but lemmatization reduces the words to a word existing in the language. We can choose either stemming or lemmatization. The one disadvantage of stemming is that it sometimes reduces a word to a word which doesn’t make any sense. For example, it may reduce the word ‘apple’ to ‘appl’ which doesn’t represent anything meaningful. However, it is easier to build a stemmer than to build a lemmatizer as the latter requires deep linguistics knowledge in constructing dictionaries to look up the lemma of the word.
Feature Extraction
We are still not done with transforming our data. Why? Because the machine only understands 1 and 0 and what we have are words. What now? Here comes ‘Feature extraction’ to the rescue.
The mapping from textual data to real-valued vectors is called feature extraction (also known as vector space modelling). This vector form of the textual data can be easily used by machine learning algorithms to perform clustering or classification. There are a number of ways to perform feature extraction, such as:
Bag of words
CountVectorizer
Term Frequency-Inverse Document Frequency (TF-IDF)
Hashing vectorizer
Word embeddings: word2vec, ElMo or glove etc.
I have discussed these techniques in detail with example in my previous blog, you can check it out here.
Feature extraction helps to:
Remove uncorrelated or superfluous features
Reduce the dimensionality
shorten the time train the model
improve the accuracy of learning algorithm
Clustering
We can cluster different texts based on the feature vectors we generate. The similarity in text can be compared by measuring the distance between these feature vectors. Some of the metrics for computing similarity between two pieces of text are Jaccard coefficient, cosine similarity and Euclidean distance.
Clustering can be divided into two broad groups:
Hard Clustering: This groups items such that each item is assigned to only one cluster. For example, determining whether a comment is expressing a positive sentiment or a negative sentiment. In this case the comment would either belong to the negative cluster or the positive cluster. e.g. K-means clustering
Soft Clustering: Sometimes we don't need a binary answer. Soft clustering is about grouping items such that an item can belong to multiple clusters. e.g. Fuzzy C Means (FCM) is a soft clustering algorithm.
Some common type of clustering methods used are:
Hierarchical: these methods can be divided into two subcategories- In the divisive (“top down”) approach, we start with one cluster and split that into sub-clusters. Example algorithms include DIANA and MONA. In the agglomerative (“bottom up”) approach, each document starts as its own cluster and then we merge similar ones into bigger clusters. Examples include BIRCH and CURE.
Partitioning: organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined above. These algorithms are efficient but sensitive to initial conditions and outliers. k-means is a popular algorithm but requires the right choice of k. Other examples are ISODATA and PAM.
Density: Instead of using a distance measure, we form clusters based on how many data points fall within a given radius. Density-based clustering connects areas of high example density into clusters. This allows for arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters. DBSCAN is the most well-known algorithm.
Graph: Some algorithms have made use of knowledge graphs to assess document similarity. This addresses the problem of polysemy (ambiguity) and synonymy (similar meaning).
Probabilistic: A cluster of words belong to a topic and the task is to identify these topics. Words also have probabilities that they belong to a topic. Topic Modelling is a separate NLP task but it's similar to soft clustering. pLSA and LDA are example topic models.
Using one of the above algorithms we can cluster our feature vectors into groups based on similarity. By this point our text has been clustered. And we can extract the needed information by looking at the clusters.
Auto-Tagging based on Cluster Centres
The algorithm then generates cluster tags, known as cluster centers which represent the documents contained in these clusters.
Use cases
Document Retrieval: To improve recall, start by adding other documents from the same cluster.
Taxonomy Generation: Automatically generate hierarchical taxonomies for browsing content.
Fake News Identification: Detect if a news is genuine or fake.
Language Translation: Translation of a sentence from one language to another.
Spam Mail Filtering: Detect unsolicited and unwanted email/messages.
Customer Support Issue Analysis: Identify commonly reported support issues.
Common challenges involved in text clustering
Document clustering is being studied for many decades. It's far from trivial or a solved problem. The challenges include the following:
Selecting appropriate features of documents that should be used for clustering.
Selecting an appropriate similarity measure between documents.
Selecting an appropriate clustering method utilizing the above similarity measure.
Implementing the clustering algorithm in an efficient way that makes it feasible in terms of memory and CPU resources.
Finding ways of assessing the quality of the performed clustering.
If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free contact us.