Finding the Levenshtein distance in Python
Submitted by Poromenos on Thu, 17/01/2008 - 22:35
I discovered today that Moneygement won’t accept unicode characters when someone adds transactions by email because of the editdist module I used to check it. Since I don’t really need a fast function to do it (it’s all eight-letter words on average), I decided to write my own Python version of the function and am sharing it here if anyone needs it (because I haven’t found a Python implementation anywhere). It’s released under a BSD license with attribution, meaning that I’d like it if my name was mentioned where it is used :)
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
Peter Norvig (of the AI book and Google fame) has an interesting article on spelling correction, which is a fairly typical use of levenshtein. In his case though he generates all of the words within edit distance 1 of a source word, which although very computationally intensive might be of interest as well.
His code, from the link, is below..
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
- reply
Submitted by Ross (not verified) on Wed, 23/04/2008 - 11:20.Oh, I remember reading that, it was indeed very interesting, thanks for the link.
---
Vidi, Vici, Veni.
- reply
Submitted by Poromenos on Wed, 23/04/2008 - 12:43.Just wanted to leave a link to a page I found with an implementation in multiple languages;
http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python
You might want to consider checking it out, and altering your use of "range" to either xrange() or enumerate().
As you mentioned, if it's comparing small stuff then it's a minimal impact but an iterator would be better then building the entire array into memory.
Mostly just for posterity sake and to say thanks for sharing!
- reply
Submitted by jay (not verified) on Fri, 27/02/2009 - 17:17.Hi, your code does not work, e.g.
print levenshtein_distance("abcdefghijk","cdefghijklm")
returns 2
4 would be correct.
The matrix is not correctly initialised.
- reply
Submitted by Stephan (not verified) on Thu, 23/04/2009 - 17:29.You are, of course, correct. The code has been fixed, thank you for the report.
- reply
Submitted by Poromenos on Thu, 23/04/2009 - 17:52.