Text Classification using scikit-learn in python

Using data from fetch_20newsgroups or you can create or use similar data to test your classification


from sklearn.datasets import fetch_20newsgroups
train_data = fetch_20newsgroups(subset='train', shuffle=True)

Similarly we can load the training data.

 test_data = fetch_20newsgroups(subset='test', shuffle=True)

Now let's take a preview data of train_data

Text Classification using scikit-learn in python

if you see above image of data you can see that train_data is a type of dictionary.

so before fitting data to any classification model you need some data preprocessing.

Let's see more about train_data and dictionary keys because we need text value and target feature from data.

train_data has four keys data, filename, target_names, target and we are interested in data, target_names, target keys.

There are 20 classes and all data are classified in one of the among 20 categories

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Here we are using CountVectorizer for creating feature vectors.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data.data)
X_train_counts.shape

‘count_vect.fit_transform(train_data.data)’ returns a Document-Term matrix.

[n_samples, n_features].

TF: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.

TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.

We can achieve both using below line of code:


from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

This code will output the dimension of the Document-Term matrix -> (11314, 130107).

Running ML algorithms

There are various algorithms which can be used for text classification. We will start with the most simplest one ‘Naive Bayes (NB)’.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_data.target)

This will train the NB classifier on the training data we provided.

Performance of NB Classifier: Now we will test the performance of the NB classifier on test set.

import numpy as np
predicted = text_clf.predict(test_data.data)
np.mean(predicted == test_data.target)

The accuracy we get is ~77.38%

Update: If anyone tries a different algorithm, please share the results in the comment section, it will be useful for everyone.

Please let me know if there were any mistakes and feedback is welcome ✌️

Recommend, comment, share if you liked this article