Using data from fetch_20newsgroups or you can create or use similar data to test your classification
from sklearn.datasets import fetch_20newsgroups
train_data = fetch_20newsgroups(subset='train', shuffle=True)
Similarly we can load the training data.
test_data = fetch_20newsgroups(subset='test', shuffle=True)
Now let's take a preview data of train_data
if you see above image of data you can see that train_data is a type of dictionary.
so before fitting data to any classification model you need some data preprocessing.
Let's see more about train_data and dictionary keys because we need text value and target feature from data.
train_data has four keys data, filename, target_names, target and we are interested in data, target_names, target keys.
There are 20 classes and all data are classified in one of the among 20 categories
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Here we are using CountVectorizer for creating feature vectors.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data.data)
X_train_counts.shape
‘count_vect.fit_transform(train_data.data)’ returns a Document-Term matrix.
[n_samples, n_features].
TF: Just counting the number of words in each document has 1 issue: it will give more weightage to longer documents than shorter documents. To avoid this, we can use frequency (TF - Term Frequencies) i.e. #count(word) / #Total words, in each document.
TF-IDF: Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.
We can achieve both using below line of code:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
This code will output the dimension of the Document-Term matrix -> (11314, 130107).
Running ML algorithms
There are various algorithms which can be used for text classification. We will start with the most simplest one ‘Naive Bayes (NB)’.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_data.target)
This will train the NB classifier on the training data we provided.
Performance of NB Classifier: Now we will test the performance of the NB classifier on test set.
import numpy as np
predicted = text_clf.predict(test_data.data)
np.mean(predicted == test_data.target)
The accuracy we get is ~77.38%
Update: If anyone tries a different algorithm, please share the results in the comment section, it will be useful for everyone.
Please let me know if there were any mistakes and feedback is welcome ✌️
Recommend, comment, share if you liked this article