A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.2.4-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e4570d44cf86321a35c1c4e9ee90ec48f9a102d8e0f0152ce48288c867ce1cf |
|
MD5 | b4abb59d364fb080f749eeb285b388cc |
|
BLAKE2b-256 | 9a6436bd708ab433f74633b53f6bb9690c84e84da19c07533b339611d31eca8d |
Hashes for simstring_fast-0.2.4-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b8f4392b38d6e0fa4a52084721ad0359c21d3f07e8dbb22e0a4f22d41397053 |
|
MD5 | fdbd0b5f2b323ea3b12bdae995ad023e |
|
BLAKE2b-256 | 19dfa92af7c16db10b29d45b3b8d04edb325c3fb3aa6fbfc1a4d0b373a3c0c88 |
Hashes for simstring_fast-0.2.4-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08eb99906321a69fe74db475095ababaeac664cf27bd241076e9c7f644e29bf7 |
|
MD5 | 41c6304fdd82b015d16e97134f415aab |
|
BLAKE2b-256 | 8ecd6a2b36ddbbc34727b8cfc3ef3a44100d43756f07ccfee7e712973c6c9bd0 |
Hashes for simstring_fast-0.2.4-cp38-cp38-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97d51324d2ba163919a1a5cacfd422bb7620ddcf8ee09b39c5ce5347f7ae793c |
|
MD5 | 15f065990605de69eabb2305d28819c6 |
|
BLAKE2b-256 | b6805bdcd7bb41f8e9f1574158af5883bc05b872882e89d13d3a25a680df7719 |