Compute distance between the two texts.
Project description
TextDistance
TextDistance  python library for comparing distance between two or more sequences by many algorithms.
Features:
 30+ algorithms
 Pure python implementation
 Simple usage
 More than two sequences comparing
 Some algorithms have more than one implementation in one class.
 Optional numpy usage for maximum speed.
Algorithms
Edit based
Algorithm  Class  Functions 

Hamming  Hamming 
hamming 
MLIPNS  MLIPNS 
mlipns 
Levenshtein  Levenshtein 
levenshtein 
DamerauLevenshtein  DamerauLevenshtein 
damerau_levenshtein 
JaroWinkler  JaroWinkler 
jaro_winkler , jaro 
Strcmp95  StrCmp95 
strcmp95 
NeedlemanWunsch  NeedlemanWunsch 
needleman_wunsch 
Gotoh  Gotoh 
gotoh 
SmithWaterman  SmithWaterman 
smith_waterman 
Token based
Algorithm  Class  Functions 

Jaccard index  Jaccard 
jaccard 
Sørensen–Dice coefficient  Sorensen 
sorensen , sorensen_dice , dice 
Tversky index  Tversky 
tversky 
Overlap coefficient  Overlap 
overlap 
Tanimoto distance  Tanimoto 
tanimoto 
Cosine similarity  Cosine 
cosine 
MongeElkan  MongeElkan 
monge_elkan 
Bag distance  Bag 
bag 
Sequence based
Algorithm  Class  Functions 

longest common subsequence similarity  LCSSeq 
lcsseq 
longest common substring similarity  LCSStr 
lcsstr 
RatcliffObershelp similarity  RatcliffObershelp 
ratcliff_obershelp 
Compression based
Normalized compression distance with different compression algorithms.
Classic compression algorithms:
Algorithm  Class  Function 

Arithmetic coding  ArithNCD 
arith_ncd 
RLE  RLENCD 
rle_ncd 
BWT RLE  BWTRLENCD 
bwtrle_ncd 
Normal compression algorithms:
Algorithm  Class  Function 

Square Root  SqrtNCD 
sqrt_ncd 
Entropy  EntropyNCD 
entropy_ncd 
Work in progress algorithms that compare two strings as array of bits:
Algorithm  Class  Function 

BZ2  BZ2NCD 
bz2_ncd 
LZMA  LZMANCD 
lzma_ncd 
ZLib  ZLIBNCD 
zlib_ncd 
See blog post for more details about NCD.
Phonetic
Algorithm  Class  Functions 

MRA  MRA 
mra 
Editex  Editex 
editex 
Simple
Algorithm  Class  Functions 

Prefix similarity  Prefix 
prefix 
Postfix similarity  Postfix 
postfix 
Length distance  Length 
length 
Identity similarity  Identity 
identity 
Matrix similarity  Matrix 
matrix 
Installation
Stable
Only pure python implementation:
pip install textdistance
With extra libraries for maximum speed:
pip install "textdistance[extras]"
With all libraries (required for benchmarking and testing):
pip install "textdistance[benchmark]"
With algorithm specific extras:
pip install "textdistance[Hamming]"
Algorithms with available extras: DamerauLevenshtein
, Hamming
, Jaro
, JaroWinkler
, Levenshtein
.
Dev
Via pip:
pip install e git+https://github.com/life4/textdistance.git#egg=textdistance
Or clone repo and install with some extras:
git clone https://github.com/life4/textdistance.git
pip install e ".[benchmark]"
Usage
All algorithms have 2 interfaces:
 Class with algorithmspecific params for customizing.
 Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences)
 calculate distance between sequences..similarity(*sequences)
 calculate similarity for sequences..maximum(*sequences)
 maximum possible value for distance and similarity. For any sequence:distance + similarity == maximum
..normalized_distance(*sequences)
 normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences)
 normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval
 qvalue for split sequences into qgrams. Possible values: 1 (default)  compare sequences by chars.
 2 or more  transform sequences to qgrams.
 None  split sequences by words.
as_set
 for tokenbased algorithms: True 
t
andttt
is equal.  False (default) 
t
andttt
is different.
 True 
Examples
For example, Hamming distance:
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.
Articles
A few articles with examples how to use textdistance in the real world:
 Guide to Fuzzy Matching with Python
 String similarity — the basic know your algorithms guide!
 Normalized compression distance
Extra libraries
For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.
You can disable this by passing external=False
argument on init:
import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3
Supported libraries:
Algorithms:
 DamerauLevenshtein
 Hamming
 Jaro
 JaroWinkler
 Levenshtein
Benchmarks
Without extras installation:
algorithm  library  time 

DamerauLevenshtein  rapidfuzz  0.00312 
DamerauLevenshtein  jellyfish  0.00591 
DamerauLevenshtein  pyxdameraulevenshtein  0.03335 
DamerauLevenshtein  textdistance  0.83524 
Hamming  Levenshtein  0.00038 
Hamming  rapidfuzz  0.00044 
Hamming  jellyfish  0.00091 
Hamming  distance  0.00812 
Hamming  textdistance  0.03531 
Jaro  rapidfuzz  0.00092 
Jaro  jellyfish  0.00191 
Jaro  textdistance  0.07365 
JaroWinkler  rapidfuzz  0.00094 
JaroWinkler  jellyfish  0.00195 
JaroWinkler  textdistance  0.07501 
Levenshtein  rapidfuzz  0.00099 
Levenshtein  Levenshtein  0.00122 
Levenshtein  jellyfish  0.00254 
Levenshtein  pylev  0.15688 
Levenshtein  distance  0.28669 
Levenshtein  textdistance  0.53902 
Total: 24 libs.
Yeah, so slow. Use TextDistance on production only with extras.
Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).
You can run benchmark manually on your system:
pip install textdistance[benchmark]
python3 m textdistance.benchmark
TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json
file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default libraries.json already included in package.
Running tests
All you need is task. See Taskfile.yml for the list of available commands. For example, to run tests including thirdparty libraries usage, execute task pytestexternal:run
.
Contributing
PRs are welcome!
 Found a bug? Fix it!
 Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.
 Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.
 Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
 Have no time to code? Tell your friends and subscribers about
textdistance
. More users, more contributions, more amazing features.
Thank you :heart:
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textdistance4.6.3py3noneany.whl
Algorithm  Hash digest  

SHA256  0cb1b2cc8e3339ddc3e0f8c870e49fb49de6ecc42a718917308b3c971f34aa56 

MD5  2d9f7629e856576ad8838da3e75cdf23 

BLAKE2b256  c6c2c62601c858010b0513a6434b9be19bd740533a6e861eddfd30b7258d92a0 