A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure import CosineMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure import JaccardMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3846a6186af4ed49f3c0be4910d37e438ae9fbbad6979077575dc2311a573327 |
|
MD5 | 1bd56b3aeba9bcc0475f04730963a456 |
|
BLAKE2b-256 | 1856f5a9d4dde2d68a5e479c8045e35f5d2cd99f453afa1193a2e99ebab9e126 |
Hashes for simstring_fast-0.1.4-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3844d6e8542ef3706b6de8c868fa290e69a56bc6e77ca86890b16161f91d735e |
|
MD5 | 4fd8b92803076edacc0cdf4950b502de |
|
BLAKE2b-256 | 14937678804feab881fd747eb1282b413127716d24052ec5c811cae87ab5cf6e |