A python implementation of a variety of text distance and similarity metrics.

These details have not been verified by PyPI

Project description

python-text-distance

A python implementation of a variety of text distance and similarity metrics.

Install
How to Use
Module
- Edit Distance
- Vector Similarity
Customize Preprocess

Install

Requirements: py>=3.3

Install Command: pip install pytextdist

How to use

The functions in this package takes two strings as input and return the distance/similarity metric between them. The preprocessing of the strings are included in the functions with default recommendation. If you want to change the preprocessing see Customize Preprocessing.

Modules

Edit Distance

By default functions in this module consider single character as the unit for editting.

Levenshtein Distance & Similarity: edit with insertion, deletion, and substitution

from pytextdist.edit_distance import levenshtein_distance, levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = levenshtein_distance(str_a,str_b)
simi = levenshtein_similarity(str_a,str_b)
print(f"Levenshtein Distance:{dist:.0f}\nLevenshtein Similarity:{simi:.2f}")

>> Levenshtein Distance:3
>> Levenshtein Similarity:0.57

Longest Common Subsequence Distance & Similarity: edit with insertion and deletion

from pytextdist.edit_distance import lcs_distance, lcs_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = lcs_distance(str_a,str_b)
simi = lcs_similarity(str_a,str_b)
print(f"LCS Distance:{dist:.0f}\nLCS Similarity:{simi:.2f}")

>> LCS Distance:5
>> LCS Similarity:0.62

Damerau-Levenshtein Distance & Similarity: edit with insertion, deletion, substitution, and transposition of two adjacent units

from pytextdist.edit_distance import damerau_levenshtein_distance, damerau_levenshtein_similarity

str_a = 'kitten'
str_b = 'sitting'
dist = damerau_levenshtein_distance(str_a,str_b)
simi = damerau_levenshtein_similarity(str_a,str_b)
print(f"Damerau-Levenshtein Distance:{dist:.0f}\nDamerau-Levenshtein Similarity:{simi:.2f}")

>> Damerau-Levenshtein Distance:3
>> Damerau-Levenshtein Similarity:0.57

Hamming Distance & Similarity: edit with substition; note that hamming metric only works for phrases of the same lengths

from pytextdist.edit_distance import hamming_distance, hamming_similarity

str_a = 'kittens'
str_b = 'sitting'
dist = hamming_distance(str_a,str_b)
simi = hamming_similarity(str_a,str_b)
print(f"Hamming Distance:{dist:.0f}\nHamming Similarity:{simi:.2f}")

>> Hamming Distance:3
>> Hamming Similarity:0.57

Jaro & Jaro-Winkler Similarity: edit with transposition

from pytextdist.edit_distance import jaro_similarity, jaro_winkler_similarity

str_a = 'sitten'
str_b = 'sitting'
simi_j = jaro_similarity(str_a,str_b)
simi_jw = jaro_winkler_similarity(str_a,str_b)
print(f"Jaro Similarity:{simi_j:.2f}\nJaro-Winkler Similarity:{simi_jw:.2f}")

>> Jaro Similarity:0.85
>> Jaro-Winkler Similarity:0.91

Vector Similarity

By default functions in this module use unigram (single word) to build vectors and calculate similarity. You can change n to other numbers for n-gram (n contiguous words) vector similarity.

Cosine Similarity

from pytextdist.vector_similarity import cosine_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = cosine_similarity(phrase_a, phrase_b, n=1)
simi_2 = cosine_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Cosine Similarity:{simi_1:.2f}\nBigram Cosine Similarity:{simi_2:.2f}")

>> Unigram Cosine Similarity:0.65
>> Bigram Cosine Similarity:0.38

Jaccard Similarity

from pytextdist.vector_similarity import jaccard_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = jaccard_similarity(phrase_a, phrase_b, n=1)
simi_2 = jaccard_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Jaccard Similarity:{simi_1:.2f}\nBigram Jaccard Similarity:{simi_2:.2f}")

>> Unigram Jaccard Similarity:0.47
>> Bigram Jaccard Similarity:0.22

Sorensen Dice Similarity

from pytextdist.vector_similarity import sorensen_dice_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = sorensen_dice_similarity(phrase_a, phrase_b, n=1)
simi_2 = sorensen_dice_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Sorensen Dice Similarity:{simi_1:.2f}\nBigram Sorensen Dice Similarity:{simi_2:.2f}")

>> Unigram Sorensen Dice Similarity:0.64
>> Bigram Sorensen Dice Similarity:0.36

Q-Gram Similarity

from pytextdist.vector_similarity import qgram_similarity

phrase_a = 'For Paperwork Reduction Act Notice, see your tax return instructions. For Paperwork Reduction Act Notice, see your tax return instructions.'
phrase_b = 'For Disclosure, Privacy Act, and Paperwork Reduction Act Notice, see separate instructions. Form 1040'
simi_1 = qgram_similarity(phrase_a, phrase_b, n=1)
simi_2 = qgram_similarity(phrase_a, phrase_b, n=2)
print(f"Unigram Q-Gram Similarity:{simi_1:.2f}\nBigram Q-Gram Similarity:{simi_2:.2f}")

>> Unigram Q-Gram Similarity:0.32
>> Bigram Q-Gram Similarity:0.15

Customize Preprocessing

All functions will perform pytextdist.preprocessing.phrase_preprocessing to clean the input strings and convert them to a list of tokens.

When grain="char" - remove specific characters from the string and convert it to a list of characters

The following boolean parameters control what characters to remove/change from the string (all True by default):

- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_space: whether to remove all space
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower case

Example:
```
from pytextdist.preprocessing import phrase_preprocessing

before = 'AI Top-50'
after = phrase_preprocessing(before, grain='char')
print(after)

>> ['a', 'i', 't', 'o', 'p']
```
When grain="word" - convert the string to a list of words and remove specific characters from the words

The string is firstly converted to a list of words assuming all words are separated by one space, then the following boolean parameters control what characters to remove/change from the string (all True by default):

- ignore_non_alnumspc: whether to remove all non-numeric/alpha/space characters
- ignore_numeric: whether to remove all numeric characters
- ignore_case: whether to convert all alpha charachers to lower case

Example:
```
from pytextdist.preprocessing import phrase_preprocessing

before = 'AI Top-50'
after = phrase_preprocessing(before, grain='word')
print(after)

>> ['ai', 'top']
```

Functions under the vector similarity module will also perform pytextdist.preprocessing.ngram_counter on the list return from pytextdist.preprocessing.phrase_preprocessing.

Convert a list of tokens to a counter of the n-grams

The following parameter control the n to use for n-grams (1 by default):

- n: number of contiguous items to use to form a sequence

Example:

from pytextdist.preprocessing import phrase_preprocessing, ngram_counter

before = 'AI Top-50 Company'
after = phrase_preprocessing(before, grain='word')
print(after)
ngram_cnt_1 = ngram_counter(after, n=1)
print(ngram_cnt_1)
ngram_cnt_2 = ngram_counter(after, n=2)
print(ngram_cnt_2)

>> ['ai', 'top', 'company']
>> Counter({'ai': 1, 'top': 1, 'company': 1})
>> Counter({'ai top': 1, 'top company': 1})

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.6

Mar 18, 2020

0.1.5

Mar 18, 2020

0.1.4

Mar 6, 2020

0.1.3

Feb 5, 2020

0.1.2

Feb 2, 2020

0.1.1

Feb 1, 2020

0.1.0

Jan 31, 2020

0.0.1

Jan 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytextdist-0.1.6.tar.gz (9.9 kB view details)

Uploaded Mar 18, 2020 Source

Built Distribution

pytextdist-0.1.6-py3-none-any.whl (10.2 kB view details)

Uploaded Mar 18, 2020 Python 3

File details

Details for the file pytextdist-0.1.6.tar.gz.

File metadata

Download URL: pytextdist-0.1.6.tar.gz
Upload date: Mar 18, 2020
Size: 9.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pytextdist-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`58bd119928dba9e0958de2771c5e99189b4cb2a06167e4daf011a6871dfa5676`
MD5	`c37362315e9da7eaf1d33e93e07445e6`
BLAKE2b-256	`953c48eec71c75a7d86ac7d5bba81269871e015876ae0510e1edffbe8d65c966`

See more details on using hashes here.

File details

Details for the file pytextdist-0.1.6-py3-none-any.whl.

File metadata

Download URL: pytextdist-0.1.6-py3-none-any.whl
Upload date: Mar 18, 2020
Size: 10.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.33.0 CPython/3.7.4

File hashes

Hashes for pytextdist-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`180abf4855ae0496de67bd6490714d956328eb4b529c57a460b9f0fee7618ee9`
MD5	`d7f20d44454abbbda2e43b0eb0270f79`
BLAKE2b-256	`b21ce29a308a6ac3125a28afae20850b0b15a06edd6f4159bb730fe303c98e36`

See more details on using hashes here.

pytextdist 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

python-text-distance

Install

How to use

Modules

Edit Distance

Vector Similarity

Customize Preprocessing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes