language

Computationally measuring similarity of terms with 6 algorithms

There are many methods of determining similarity and difference between terms with nltk. None are simpler to implement than the Levenshtein edit distance – but in many ways, this algorithm is grossly insufficient, because it doesn’t take into consideration a word’s meaning or sense (at all!). For accuracy, I’ve found that Wu-Palmer is the all-around most reliable. And even this has some not-too-obvious limitations. This blog post shows how each algorithm stacks up when comparing the word yell with some semantically adjacent verbs. Python code is attached.