Codestin Search App

gopalkoduri / string-matching Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

This python code, when given a term and a list of terms, gives the possible duplicates of the given term in the given list of terms. The basic idea is to make use of edit distance and longest common subsequences, not just with immediate matches but also with the matches of the matches!

1 star 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
Modified BSD License		Modified BSD License
README		README

Repository files navigation

Purpose:
--------

This is a simple program to find out the wrong/other spellings of a 
given word. It works with two string distance measures - longest common
subsequence, and damerau levenshtein distance. 

Further, the core strength lies in being able to search not only for
direct matches, but also for the matches of matches and so on. This is
particularly useful in the cases such as these:

Sowrashtram matches Sourashtram but not Saurashtram.
But Sourashtram matches Saurashtram.

Of course, loosening the threshold can help, but also decreases the 
precision. Therefore the solution is to search with "tight" parameters,
but with an extensive search mechanism.

Usage:
------

>>> import stringDuplicates
>>> terms = ["kopala", "gopal", "george", "mohammed", "arjuna"]
>>> stringDuplicates.stringDuplicates("gopala", terms, simThresh=0.8, recursion=1)
['gopal', 'kopala']

The two crucial parameters, as can be understood from the description,
are simThresh and recursion.


Contact Info:
-------------

Gopala Krishna Koduri
gopala.koduri -AT- gmail.com