Autocorrect and edit distance problem in Java

When you search in Google, it provides autocorrect for validating the keywords you enter into the input box. Behind the scene, it checks the input words against a dictionary. If it doesn’t find the keyword in the dictionary, it suggests a most likely replacement.

Here I introduce a simple solution to implement autocorrect. To do this, the program associates with every word in the dictionary a frequency, the percent of the time that word is expected to appear in a large document. When a word is misspelled (i.e. it is not found in the dictionary), the program suggests a “similar” word whose frequency is larger or equal to any other “similar” word.

For implementation, first you can implement the dictionary as a Trie. A Trie is a tree-based data structure designed to store items that are sequences of characters from an alphabet. Each Trie-Node stores a count and a sequence of Nodes, one for each element in the alphabet. Each Trie-Node has a single parent except for the root of the Trie which does not have a parent. A sequence of characters from the alphabet is stored in the Trie as a path in the Trie from the root to the last character of the sequence.

Edit Distances is the key part to define the similarity of two words. In computer science, it is a way to measure the number of different letters between two strings. There are four Edit Distances:
● Deletion Distance: 1 from “bird” are “ird”, “brd”, “bid”, and “bir”.
● Transposition Distance: 1 from “house” are “ohuse”, “huose”, “hosue” and “houes”.
● Alteration Distance: 1 from “top” are “aop”, “bop”, …, “zop”, “tap”, “tbp”, …, “tzp”, “toa”, “tob”, …, and “toz”.
● Insertion Distance: 1 from “ask” are “aask”, “bask”, “cask”, … “zask”
Edit distance can be efficiently implemented by using Dynamic programming.

edit distance dynamic programming

Now lets clarify the rules of “most similar” based on Edit distance:
1. It has the “closest” edit distance from the input string.
2. If multiple words have the same edit distance, the most similar word is the one with the closest edit distance that is found the most times in the dictionary.
3. If multiple words are the same edit distance and have the same count/frequency, the most similar word is the one that is first alphabetically.

Google Interview Question:
Implement a spell autocorrect with edit distance as in Google search.

Java Code:

Output:
edit distance:
bird -> ird: 1
ohuse -> house: 1
zopper -> top: 4
ask -> askhim: 3
suggest of aop is top
suggest of bloat is boat
suggest of reah is read

O Notation:
Time complexity: O(S),
S is number of words in dictionary
Space complexity: O(K*N),
K is number of edit distances, N is max number of similar words


Download SpellAutocorrector.zip

Comments are closed