Introduction
In the field of Natural Language Processing (NLP), one exciting application is the development of predictive keyboards. These keyboards use computational techniques to make word predictions based on the text input provided by the user. In this blog post, we will explore the process of building a predictive keyboard using NLP concepts and Python programming.
Overview
The predictive keyboard we will create utilizes a unigram language model. This model keeps track of every unique word encountered and calculates the probability of each word occurring based on the training data. The keyboard predicts the most likely word given the currently entered word prefix.
Implementation Steps:
Creating a Map Class: We begin by implementing a Map class that inherits from a HashTable class. We will use Python's built-in dictionary object and our own Map class to implement the hash map.
DictEntry Class: Next, we define a DictEntry class that represents a word and its unigram probability. This class will have methods to retrieve the word and probability, as well as to match the word with a given pattern.
WordPredictor Class: The main component of our predictive keyboard is the WordPredictor class. This class learns to make word predictions based on the training data provided. It has methods to train the model using text files, count the number of words encountered, retrieve word counts, and build the word probabilities and prefix mapping.
Training the Model: Training the model involves parsing the training data, updating word counts, and computing word probabilities. The train() method reads words from a file, converts them to lowercase, and strips any non-alphabetic characters. The train_word() method trains the model on a single word. After training, the build() method is called to recompute word probabilities and prefix mappings.
Making Predictions: The WordPredictor class can predict the most probable word for a given word prefix using the prefix_to_entry map. The get_best() method retrieves the most likely DictEntry object for a given prefix.
Bonus
The bonus part of the project involves enhancing the predictive keyboard. This includes implementing features such as displaying the top N possible completions for the current word and incorporating a bigram language model to make predictions based on the prior word.
“Classes to Define
Map
Implement a Map class that inherits from a HashTable class (see the class notes for much of this code). Your HashTable class should implement chaining for collision handling.
Note: I recommend first solving this assignment using the built in Python dictionary object. Once that is working, then replace the dictionaries with your Map objects.
DictEntry
We need a class to represent a word and its unigram probability. Here is the API to DictEntry:
# create a new entry given a word and probability
__init__(word, prob) # string, float
# getter for the word
get_word() # returns string
# getter for the probability
get_prob() # returns float
# does this word match the given pattern?
match_pattern(pattern) # returns string
WordPredictor
The brains of this whole operation is the class WordPredictor. This class learns how to make word predictions based on being shown training data. It can learn from all the words in a file (via the train() method) or from a single individual word (via the train_word() method). After new training data is added, the build() method must be called so the class can recompute the most likely word for all possible prefixes. Here is the API for the WordPredictor class:
# train the unigram model on all the words in the given file
train(training_file) # string
# train on just a single word
train_word(word) # string
# get the number of total words we've trained on
get_training_count() # returns integer
# get the number of times we've seen this word (0 if never)
get_word_count(word) # string. returns integer
# recompute the word probabilities and prefix mapping
build()
# return the most likely DictEntry object given a word prefix
get_best(prefix) # string. returns DictEntry
Training the Model
Model training occurs in the train() and train_word() methods. train() should parse out each word in the specified file on disk. If the file cannot be read, it should print out an error, "Could not open training file: file.txt". All training words should be converted to lowercase and stripped of any characters that are not a-z or the single apostrophe. During training, you need to update the instance variables:
word_to_count: Map of string (word key) to integer (count value) pairs
total: Integer
The word_to_count instance variable is a Map where the keys are the unique words encountered in the training data. The values are the integer count of how many times we've seen each word in the data. Only words seen in the training data will have an entry in the word_to_count Map. The total instance variable tracks how many words of training data we have seen. That is, total should equal the sum of all integer counts stored in your map. Training is cumulative for repeated calls to train() and train_word(), so you just keep increasing the counts stored in your word_to_count map and your total count of training words.
Making Predictions
The class needs to be able to predict the most probable word for a given word prefix give the statistics found during training. For example, after training on mobydick.txt, if the user types the prefix "wh", "wha", "whal" or "whale", the most likely word is "whale". The job of mapping word prefix strings to the most probable dictionary entry falls to the third instance variable:
prefix_to_entry: Map of string (prefix key) to DictEntry (value) pairs
A Map can map multiple keys to the same value. This is exactly what we need in this case since a set of strings (such as "wh", "wha", "whal" and "whale") may all need to map to a single DictEntry object. You need to ensure that each prefix maps to its most likely word. So even if the word "thespian" was encountered first in the training data, for any sensible training text, "th" should probably map to "the" and not "thespian".
Every time the build() method is called, you need to re-initialize the contents of prefix_to_entry. This ensures that you take into account any new words or updated counts present in word_to_count. Here is an overview of the algorithm you should be aiming for in build():
Loop over all possible (key, value) pairs in word_to_count.
For each pair: Calculate the probability of that word given its count and the number of training words. Create a new DictEntry object representing this word and its probability.
For each pair: Loop over all possible prefixes of the word. That is from a prefix of only the first letter to a "prefix" that is the entire word.
For each prefix, if the current key (word) has a probability strictly greater than what is currently stored for that prefix, replace the old value (a DictEntry object) with the new object.
This results in our map containing a mapping from every valid prefix of every training word to the most probable complete word under the unigram language model. Your get_best() method can be a one-liner!
Testing the Prediction Class
There is a fairly extensive test main() available to in keyboard_test.py. The first part of main() tests out your training routines, while the second half focuses on the prediction part. Here is the output you can use to compare with for correctness (not all output will match exactly, think about this):
bad1 = null
training words = 202
bad2 = null
Could not open training file: thisfiledoesnotexist.txt
training words = 202
count, the = 10
count, me = 5
count, zebra = 0
count, ishmael = 1
count, savage = 0
count, the = 32
count, me = 5
count, zebra = 0
count, ishmael = 1
count, savage = 1
a -> and 0.03917050691244239
ab -> about 0.004608294930875576
b -> bird 0.004608294930875576
be -> before 0.002304147465437788
t -> the 0.07373271889400922
th -> the 0.07373271889400922
archang -> archangelic 0.002304147465437788
training words = 434
before, b -> bird 0.004608294930875576
before, pn -> null
after, b -> bird 0.004576659038901602
after, pn -> pneumonoultramicroscopicsilicovolcanoconiosis 0.002288329519450801
training words = 437
a -> and 0.030079417288752873
ab -> about 0.001472985727769356
b -> but 0.008580499385064212
be -> be 0.0048908846494866
t -> the 0.06757143265738066
th -> the 0.06757143265738066
archang -> archangel 3.3368608719694154E-5
training words = 209778
elapsed (s) : 0.459
Random load test:
elapsed (s) : 3.447
Hit % = 30.0537
Bonus
Create a more advanced predictive keyboard that can show the top N possible completions for the current word. The value for N is passed as the first argument on the command-line. The interface should show up to N words that are consistent with the currently typed letters. Less than N words may be shown for some prefixes. The letters should be labeled with the numbers 0 up to N-1. If the user hits a number, the coorresponding predictions (if any) is output. The return key should select the best prediction. Try this using the wiki_200k.txt file.
Make your predictions depend on not only the current word prefix, but also on the prior word (i.e. a bigram language model). Note that you might not always be able to make a prediction using the prior word (e.g. at the start of the sentence or if the prior word was never seen in the training data). This is a common problem in language modeling called sparsity. One way to handle this problem is to backoff to a lower-order language model. In your case, this would mean backing off to the unigram model which can always make a prediction as long as the current prefix is consistent with some word in the training data.
”
In this blog post, we explored the process of building a predictive keyboard using NLP techniques. We learned about unigram language models, hash maps, and how to train a predictive model based on training data. By following the implementation steps and testing the prediction class, we can create a fully functional predictive keyboard that can make accurate word predictions. This project demonstrates the power of NLP in enhancing user experience and can be extended further to incorporate more advanced features.
If you require assistance with a similar project or need help with any NLP-related projects, feel free to contact us. Our team of experts is well-versed in NLP and can provide the support you need.
תגובות