Finding the Levenshtein distance in Python

I discovered today that Moneygement won’t accept unicode characters when someone adds transactions by email because of the editdist module I used to check it. Since I don’t really need a fast function to do it (it’s all eight-letter words on average), I decided to write my own Python version of the function and am sharing it here if anyone needs it (because I haven’t found a Python implementation anywhere). It’s released under a BSD license with attribution, meaning that I’d like it if my name was mentioned where it is used :)

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

Related material

Peter Norvig (of the AI book and Google fame) has an interesting article on spelling correction, which is a fairly typical use of levenshtein. In his case though he generates all of the words within edit distance 1 of a source word, which although very computationally intensive might be of interest as well.

His code, from the link, is below..


alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion

Submitted by Ross (not verified) on Wed, 23/04/2008 - 11:20.
Re: Norvig

Oh, I remember reading that, it was indeed very interesting, thanks for the link.
---
Vidi, Vici, Veni.

Submitted by Poromenos on Wed, 23/04/2008 - 12:43.
Another option

Just wanted to leave a link to a page I found with an implementation in multiple languages;

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Python

You might want to consider checking it out, and altering your use of "range" to either xrange() or enumerate().

As you mentioned, if it's comparing small stuff then it's a minimal impact but an iterator would be better then building the entire array into memory.

Mostly just for posterity sake and to say thanks for sharing!

Submitted by jay (not verified) on Fri, 27/02/2009 - 17:17.
Hi, your code does not work,

Hi, your code does not work, e.g.
print levenshtein_distance("abcdefghijk","cdefghijklm")
returns 2
4 would be correct.
The matrix is not correctly initialised.

Submitted by Stephan (not verified) on Thu, 23/04/2009 - 17:29.
Re: Error

You are, of course, correct. The code has been fixed, thank you for the report.

Submitted by Poromenos on Thu, 23/04/2009 - 17:52.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Google ads

(You can disable these if you log in)