Abydos NLP/IR library
Project description
Abydos
CI & Test Status |
|
Code Quality |
|
Dependencies |
|
Local Analysis |
|
Usage |
|
Contribution |
|
PyPI |
|
conda-forge |
Abydos is a library of phonetic algorithms, string distance measures & metrics, stemmers, and string fingerprinters including:
- Phonetic algorithms
Robert C. Russell’s Index
American Soundex
Refined Soundex
Daitch-Mokotoff Soundex
Kölner Phonetik
NYSIIS
Match Rating Algorithm
Metaphone
Double Metaphone
Caverphone
Alpha Search Inquiry System
Fuzzy Soundex
Phonex
Phonem
Phonix
SfinxBis
phonet
Standardized Phonetic Frequency Code
Statistics Canada
Lein
Roger Root
Oxford Name Compression Algorithm (ONCA)
Eudex phonetic hash
Haase Phonetik
Reth-Schek Phonetik
FONEM
Parmar-Kumbharana
Davidson’s Consonant Code
SoundD
PSHP Soundex/Viewex Coding
an early version of Henry Code
Norphone
Dolby Code
Phonetic Spanish
Spanish Metaphone
MetaSoundex
SoundexBR
NRL English-to-phoneme
Beider-Morse Phonetic Matching
- String distance metrics
Levenshtein distance
Optimal String Alignment distance
Levenshtein-Damerau distance
Hamming distance
Tversky index
Sørensen–Dice coefficient & distance
Jaccard similarity coefficient & distance
overlap similarity & distance
Tanimoto coefficient & distance
Minkowski distance & similarity
Manhattan distance & similarity
Euclidean distance & similarity
Chebyshev distance
cosine similarity & distance
Jaro distance
Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
Longest common substring
Ratcliff-Obershelp similarity & distance
Match Rating Algorithm similarity
Normalized Compression Distance (NCD) & similarity
Monge-Elkan similarity & distance
Matrix similarity
Needleman-Wunsch score
Smith-Waterman score
Gotoh score
Length similarity
Prefix, Suffix, and Identity similarity & distance
Modified Language-Independent Product Name Search (MLIPNS) similarity & distance
Bag distance
Editex distance
Eudex distances
Sift4 distance
Baystat distance & similarity
Typo distance
Indel distance
Synoname
- Stemmers
the Lovins stemmer
the Porter and Porter2 (Snowball English) stemmers
Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
CLEF German, German plus, and Swedish stemmers
Caumann’s German stemmer
UEA-Lite Stemmer
Paice-Husk Stemmer
Schinke Latin stemmer
S stemmer
- String Fingerprints
string fingerprint
q-gram fingerprint
phonetic fingerprint
Pollock & Zomora’s skeleton key
Pollock & Zomora’s omission key
Cisłak & Grabowski’s occurrence fingerprint
Cisłak & Grabowski’s occurrence halved fingerprint
Cisłak & Grabowski’s count fingerprint
Cisłak & Grabowski’s position fingerprint
Synoname Toolcode
Installation
Required libraries:
NumPy
deprecation
Optional libraries (all available on PyPI, some available on conda or conda-forge):
To install Abydos (master) from Github source:
git clone https://github.com/chrislit/abydos.git --recursive cd abydos python setup install
If your default python command calls Python 2.7 but you want to install for Python 3, you may instead need to call:
python3 setup install
To install Abydos (latest release) from PyPI using pip:
pip install abydos
To install from conda-forge:
conda install abydos
It should run on Python 3.5-3.8.
Testing & Contributing
To run the whole test-suite just call tox:
tox
The tox setup has the following environments: black, py37, doctest, regression, fuzz, pylint, pydocstyle, flake8, doc8, docs, sloccount, badges, & build. So if you only want to generate documentation (in HTML, EPUB, & PDF formats), just call:
tox -e docs
In order to only run & generate Flake8 reports, call:
tox -e flake8
Contributions such as bug reports, PRs, suggestions, desired new features, etc. are welcome through Github Issues & Pull requests.
Release History
0.5.0 (2020-01-10) ecgtheow
doi:10.5281/zenodo.3603514
Changes:
Support for Python 2.7 was removed.
0.4.1 (2020-01-07) distant dietrich
doi:10.5281/zenodo.3600548
Changes:
Support for Python 3.4 was removed. (3.4 reached end-of-life on March 18, 2019)
Fuzzy intersections were corrected to avoid over-counting partial intersection instances.
Levenshtein can now return an optimal alignment
- Added the following distance measures:
Indice de Similitude-Guth (ISG)
INClusion Programme
Guth
Victorian Panel Study (VPS) score
LIG3 similarity
Discounted Levenshtein
Relaxed Hamming
String subsequence kernel (SSK) similarity
Phonetic edit distance
Henderson-Heron dissimilarity
Raup-Crick similarity
Millar’s binomial deviance dissimilarity
Morisita similarity
Horn-Morisita similarity
Clark’s coefficient of divergence
Chao’s Jaccard similarity
Chao’s Dice similarity
Cao’s CY similarity (CYs) and dissimilarity (CYd)
- Added the following fingerprint classes:
Taft’s Consonant coding
Taft’s Extract - letter list
Taft’s Extract - position & frequency
L.A. County Sheriff’s System
Library of Congres Cutter table encoding
- Added the following phonetic algorithms:
Ainsworth’s grapheme-to-phoneme
PHONIC
0.4.0 (2019-05-30) dietrich
doi:10.5281/zenodo.3235034
Version 0.4.0 focuses on distance measures, adding 211 new measures. Attempts were made to provide normalized version for measure that did not inherently range from 0 to 1. The other major focus was the addition of 12 tokenizers, in service of expanding distance measure options.
Changes:
Support for Python 3.3 was dropped.
Deprecated functions that merely wrap class methods to maintain API compatibility, for removal in 0.6.0
- Added methods to ConfusionTable to return:
its internal representation
false negative rate
false omission rate
positive & negative likelihood ratios
diagnostic odds ratio
error rate
prevalence
Jaccard index
D-measure
Phi coefficient
joint, actual, & predicted entropies
mutual information
proficiency (uncertainty coefficient)
information gain ratio
dependency
lift
Deprecated f-measure & g-measure from ConfusionTable for removal in 0.6.0
Added notes to indicate when functions, classes, & methods were added
- Added the following 12 tokenizers:
QSkipgrams
CharacterTokenizer
RegexpTokenizer, WhitespaceTokenizer, & WordpunctTokenizer
COrVClusterTokenizer, CVClusterTokenizer, & VCClusterTokenizer
SonoriPyTokenizer & LegaliPyTokenizer
NLTKTokenizer
SAPSTokenizer
Added the UnigramCorpus class & a facility for downloading data, such as pre-processed/trained data, from storage on GitHub
Added the Wåhlin phonetic encoding
- Added the following 211 similarity/distance/correlation measures:
ALINE
AMPLE
Anderberg
Andres & Marzo’s Delta
Average Linkage
AZZOO
Baroni-Urbani & Buser I & II
Batagelj & Bren
Baulieu I-XV
Benini I & II
Bennet
Bhattacharyya
BI-SIM
BLEU
Block Levenshtein
Brainerd-Robinson
Braun-Blanquet
Canberra
Chord
Clement
Cohen’s Kappa
Cole
Complete Linkage
Consonni & Todeschini I-V
Cormode’s LZ
Covington
Dennis
Dice Asymmetric I & II
Digby
Dispersion
Doolittle
Dunning
Eyraud
Fager & McGowan
Faith
Fellegi-Sunter
Fidelity
Fleiss
Fleiss-Levin-Paik
FlexMetric
Forbes I & II
Fossum
FuzzyWuzzy Partial String
FuzzyWuzzy Token Set
FuzzyWuzzy Token Sort
Generalized Fleiss
Gilbert
Gilbert & Wells
Gini I & II
Goodall
Goodman & Kruskal’s Lambda
Goodman & Kruskal’s Lambda-r
Goodman & Kruskal’s Tau A & B
Gower & Legendre
Guttman’s Lambda A & B
Gwet’s AC
Hamann
Harris & Lahey
Hassanat
Hawkins & Dotson
Hellinger
Higuera & Mico
Hurlbert
Iterative SubString
Jaccard-NM
Jensen-Shannon
Johnson
Kendall’s Tau
Kent & Foster I & II
Koppen I & II
Kuder & Richardson
Kuhns I-XII
Kulczynski I & II
Longest Common Prefix
Longest Common Suffix
Lorentzian
Maarel
Marking
Marking Metric
MASI
Matusita
Maxwell & Pilliner
McConnaughey
McEwen & Michael
MetaLevenshtein
Michelet
MinHash
Mountford
Mean Squared Contingency
Mutual Information
NCD with LZSS
NCD with PAQ9a
Ozbay
Pattern
Pearson’s Chi-Squared
Pearson & Heron II
Pearson II & III
Pearson’s Phi
Peirce
Positional Q-Gram Dice, Jaccard, & Overlap
Q-Gram
Quantitative Cosine, Dice, & Jaccard
Rees-Levenshtein
Roberts
Rogers & Tanimoto
Rogot & Goldberg
Rouge-L, -S, -SU, & -W
Russell & Rao
SAPS
Scott’s Pi
Shape
Shapira & Storer I
Sift4 Extended
Single Linkage
Size
Soft Cosine
SoftTF-IDF
Sokal & Michener
Sokal & Sneath I-V
Sorgenfrei
Steffensen
Stiles
Stuart’s Tau
Tarantula
Tarwid
Tetrachoric
TF-IDF
Tichy
Tulloss’s R, S, T, & U
Unigram Subtuple
Unknown A-M
Upholt
Warrens I-V
Weighted Jaccard
Whittaker
Yates’ Chi-Squared
YJHHR
Yujian & Bo
Yule’s Q, Q II, & Y
Four intersection types are now supported for all distance measure that are based on _TokenDistance. In addition to basic crisp intersections, soft, fuzzy, and group linkage intersections have been provided.
0.3.6 (2018-11-17) classy carl
doi:10.5281/zenodo.1490537
Changes:
Most functions were encapsulated into classes.
Each class is broken out into its own file, with test files paralleling library files.
Documentation was converted from Sphinx markup to Numpy style.
A tutorial was written for each subpackage.
Documentation was cleaned up, with math markup corrections and many additional links.
0.3.5 (2018-10-31) cantankerous carl
doi:10.5281/zenodo.1463204
Version 0.3.5 focuses on refactoring the whole project. The API itself remains largely the same as in previous versions, but underlyingly modules have been split up. Essentially no new features are added (bugfixes aside) in this version.
Changes:
Refactored library and tests into smaller modules
Broke compression distances (NCD) out into separate functions
Adopted Black code style
Added pyproject.toml to use Poetry for packaging (but will continue using setuptools and setup.py for the present)
Minor bug fixes
0.3.0 (2018-10-15) carl
doi:10.5281/zenodo.1462443
Version 0.3.0 focuses on additional phonetic algorithms, but does add numerous distance measures, fingerprints, and even a few stemmers. Another focus was getting everything to build again (including docs) and to move to more standard modern tools (flake8, tox, etc.).
Changes:
Fixed implementation of Bag distance
Updated BMPM to version 3.10
Fixed Sphinx documentation on readthedocs.org
Split string fingerprints out of clustering into their own module
Added support for q-grams to skip-n characters
- New phonetic algorithms:
Statistics Canada
Lein
Roger Root
Oxford Name Compression Algorithm (ONCA)
Eudex phonetic hash
Haase Phonetik
Reth-Schek Phonetik
FONEM
Parmar-Kumbharana
Davidson’s Consonant Code
SoundD
PSHP Soundex/Viewex Coding
an early version of Henry Code
Norphone
Dolby Code
Phonetic Spanish
Spanish Metaphone
MetaSoundex
SoundexBR
NRL English-to-phoneme
- New string fingerprints:
Cisłak & Grabowski’s occurrence fingerprint
Cisłak & Grabowski’s occurrence halved fingerprint
Cisłak & Grabowski’s count fingerprint
Cisłak & Grabowski’s position fingerprint
Synoname Toolcode
- New distance measures:
Minkowski distance & similarity
Manhattan distance & similarity
Euclidean distance & similarity
Chebyshev distance & similarity
Eudex distances
Sift4 distance
Baystat distance & similarity
Typo distance
Indel distance
Synoname
- New stemmers:
UEA-Lite Stemmer
Paice-Husk Stemmer
Schinke Latin stemmer
S stemmer
Eliminated ._compat submodule in favor of six
Transitioned from PEP8 to flake8, etc.
Phonetic algorithms now consistently use max_length=-1 to indicate that there should be no length limit
Added example notebooks in binder directory
0.2.0 (2015-05-27) berthold
Added Caumanns’ German stemmer
Added Lovins’ English stemmer
Updated Beider-Morse Phonetic Matching to 3.04
Added Sphinx documentation
0.1.1 (2015-05-12) albrecht
First Beta release to PyPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for abydos-0.5.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe758c8f8456a703b7637ab9ac49457c1461d1ee61c97b52a6d803a567f355e1 |
|
MD5 | 7c3e776c523e723332beab3272f9a326 |
|
BLAKE2b-256 | 7fa5ca258a571997be1c9483d6075bbc1b9487ae80f3bb3bf1f60db0b29f5aa6 |