A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
- Left Overlap
Supported database backends
- dictionary
- diskcache (sqlite)
- redis (in development #37)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.2.7-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92645ff3e59cc6fa2c8a3ff6814a8a6e67aafe6853b57e87f81074de54e1c4eb |
|
MD5 | 9db71c79837140fbdc868fc8abe7ec34 |
|
BLAKE2b-256 | 682b433a10661accf9e067298f29ce2bede91a57180f5668940f825b056bb294 |
Hashes for simstring_fast-0.2.7-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b08b59a4061c8cc94cb9ff13f76e7eb4191a19777e765a71a1506548f0e05d3e |
|
MD5 | a9d3ced394b462074949eccdbad2d766 |
|
BLAKE2b-256 | 7e674a7a82240eaedd879041fe3547555b50eca42d2e4080b7d8df0048ca658b |
Hashes for simstring_fast-0.2.7-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50a399612d69cd296c93a67f8a0119a3d4b6a742c55c097f9eacc180f0a18fc8 |
|
MD5 | c2005f75c6e97f24db76a8b29efca7da |
|
BLAKE2b-256 | 03ff57d00006b241816f70b09da2ed0ebca588859fa3f63735ac8e66fef669da |
Hashes for simstring_fast-0.2.7-cp38-cp38-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad0176ec610aa23a148eed21e1f42dc04d459b17201869a4aa805a5e91d6a615 |
|
MD5 | 8b2a026b04028eca65088f656c67c727 |
|
BLAKE2b-256 | 998fe8623bbe7f43f8b656f9fac7b2090f74b6e1281718ca4a8ea2e4a9edfa14 |