Skip to main content

A python implementation of a variety of text distance and similarity metrics.

Project description

python-text-distance

MIT license Generic badge Build Status Maintenance Generic badge

A python implementation of a variety of text distance and similarity metrics.


Install

Requirements: py>=3.3

Install Command: pip install pytextdist


How to use

The functions in this package takes two strings as input and return the distance/similarity metric between them. The preprocessing of the strings are included in the functions with default recommendation. If you want to change the preprocessing see Customize Preprocessing.


Modules

Edit Distance

By default functions in this module consider single character as the unit for editting.

Levenshtein Distance & Similarity: edit with insertion, deletion, and substitution

from pytextdist.edit_distance import levenshtein_distance, levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = levenshtein_distance(str_a,str_b)
simi = levenshtein_similarity(str_a,str_b)
print(f"Levenshtein Distance:{dist:.0f}\nLevenshtein Similarity:{simi:.2f}")

>> Levenshtein Distance:3
>> Levenshtein Similarity:0.57

Longest Common Subsequence Distance & Similarity: edit with insertion and deletion

from pytextdist.edit_distance import lcs_distance, lcs_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = lcs_distance(str_a,str_b)
simi = lcs_similarity(str_a,str_b)
print(f"LCS Distance:{dist:.0f}\nLCS Similarity:{simi:.2f}")

>> LCS Distance:5
>> LCS Similarity:0.62

Damerau-Levenshtein Distance & Similarity: edit with insertion, deletion, substitution, and transposition of two adjacent units

from pytextdist.edit_distance import damerau_levenshtein_distance, damerau_levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = damerau_levenshtein_distance(str_a,str_b)
simi = damerau_levenshtein_similarity(str_a,str_b)
print(f"Damerau-Levenshtein Distance:{dist:.0f}\nDamerau-Levenshtein Similarity:{simi:.2f}")

>> Damerau-Levenshtein Distance:3
>> Damerau-Levenshtein Similarity:0.57

Hamming Distance & Similarity: edit with substition; note that hamming metric only works for phrases of the same lengths

from pytextdist.edit_distance import hamming_distance, hamming_similarity

str_a = 'kittens'
str_b = 'sitting'
dist = hamming_distance(str_a,str_b)
simi = hamming_similarity(str_a,str_b)
print(f"Hamming Distance:{dist:.0f}\nHamming Similarity:{simi:.2f}")

>> Hamming Distance:3
>> Hamming Similarity:0.57

Jaro & Jaro-Winkler Similarity: edit with transposition

from pytextdist.edit_distance import jaro_similarity, jaro_winkler_similarity

str_a = 'sitten'
str_b = 'sitting'
simi_j = jaro_similarity(str_a,str_b)
simi_jw = jaro_winkler_similarity(str_a,str_b)
print(f"Jaro Similarity:{simi_j:.2f}\nJaro-Winkler Similarity:{simi_jw:.2f}")

>> Jaro Similarity:0.85
>> Jaro-Winkler Similarity:0.91

Vector Similarity

By default functions in this module use unigram (single word) to build vectors and calculate similarity. You can change n to other numbers for n-gram (n contiguous words) vector similarity.

Cosine Similarity

from pytextdist.vector_similarity import cosine_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = cosine_similarity(phrase_a, phrase_b, n=1)
simi_2 = cosine_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Cosine Similarity:{simi_1:.2f}\nBigram Cosine Similarity:{simi_2:.2f}")

>> Unigram Cosine Similarity:0.65
>> Bigram Cosine Similarity:0.38

Jaccard Similarity

from pytextdist.vector_similarity import jaccard_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = jaccard_similarity(phrase_a, phrase_b, n=1)
simi_2 = jaccard_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Jaccard Similarity:{simi_1:.2f}\nBigram Jaccard Similarity:{simi_2:.2f}")

>> Unigram Jaccard Similarity:0.47
>> Bigram Jaccard Similarity:0.22

Sorensen Dice Similarity

from pytextdist.vector_similarity import sorensen_dice_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = sorensen_dice_similarity(phrase_a, phrase_b, n=1)
simi_2 = sorensen_dice_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Sorensen Dice Similarity:{simi_1:.2f}\nBigram Sorensen Dice Similarity:{simi_2:.2f}")

>> Unigram Sorensen Dice Similarity:0.64
>> Bigram Sorensen Dice Similarity:0.36

Q-Gram Similarity

from pytextdist.vector_similarity import qgram_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = qgram_similarity(phrase_a, phrase_b, n=1)
simi_2 = qgram_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Q-Gram Similarity:{simi_1:.2f}\nBigram Q-Gram Similarity:{simi_2:.2f}")

>> Unigram Q-Gram Similarity:0.32
>> Bigram Q-Gram Similarity:0.15

Customize Preprocessing

All functions will perform pytextdist.preprocessing.phrase_preprocessing to clean the input strings and convert them to a list of tokens.

  • When grain="char" - remove specific characters from the string and convert it to a list of characters

    The following boolean parameters control what characters to remove/change from the string (all True by default):

    - ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
    - ignore_space: whether to remove all space
    - ignore_numeric: whether to remove all numeric characters
    - ignore_case: whether to convert all alpha charachers to lower case

    Example:

    from pytextdist.preprocessing import phrase_preprocessing
    
    before = 'AI Top-50'
    after = phrase_preprocessing(before, grain='char')
    print(after)
    
    >> ['a', 'i', 't', 'o', 'p']
    
  • When grain="word" - convert the string to a list of words and remove specific characters from the words

    The string is firstly converted to a list of words assuming all words are separated by one space, then the following boolean parameters control what characters to remove/change from the string (all True by default):

    - ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
    - ignore_numeric: whether to remove all numeric characters
    - ignore_case: whether to convert all alpha charachers to lower case

    Example:

    from pytextdist.preprocessing import phrase_preprocessing
    
    before = 'AI Top-50'
    after = phrase_preprocessing(before, grain='word')
    print(after)
    
    >> ['ai', 'top']
    

Functions under the vector similarity module will also perform pytextdist.preprocessing.ngram_counter on the list return from pytextdist.preprocessing.phrase_preprocessing.

  • Convert a list of tokens to a counter of the n-grams

    The following parameter control the n to use for n-grams (1 by default):

    - n: number of contiguous items to use to form a sequence

    Example:

    from pytextdist.preprocessing import phrase_preprocessing, ngram_counter
    
    before = 'AI Top-50 Company'
    after = phrase_preprocessing(before, grain='word')
    print(after)
    ngram_cnt_1 = ngram_counter(after, n=1)
    print(ngram_cnt_1)
    ngram_cnt_2 = ngram_counter(after, n=2)
    print(ngram_cnt_2)
    
    >> ['ai', 'top', 'company']
    >> Counter({'ai': 1, 'top': 1, 'company': 1})
    >> Counter({'ai top': 1, 'top company': 1})
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytextdist-0.1.6.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

pytextdist-0.1.6-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file pytextdist-0.1.6.tar.gz.

File metadata

  • Download URL: pytextdist-0.1.6.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pytextdist-0.1.6.tar.gz
Algorithm Hash digest
SHA256 58bd119928dba9e0958de2771c5e99189b4cb2a06167e4daf011a6871dfa5676
MD5 c37362315e9da7eaf1d33e93e07445e6
BLAKE2b-256 953c48eec71c75a7d86ac7d5bba81269871e015876ae0510e1edffbe8d65c966

See more details on using hashes here.

File details

Details for the file pytextdist-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: pytextdist-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pytextdist-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 180abf4855ae0496de67bd6490714d956328eb4b529c57a460b9f0fee7618ee9
MD5 d7f20d44454abbbda2e43b0eb0270f79
BLAKE2b-256 b21ce29a308a6ac3125a28afae20850b0b15a06edd6f4159bb730fe303c98e36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page