A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure import JaccardMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.1.6-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 101341cb8e398c68bc1f61ef5e39be152876dbbcb17a9dd5ff4e81dccc735a1a |
|
MD5 | eb10814f3899dd31b9a424a044a869d1 |
|
BLAKE2b-256 | 861958c5c377c7c5d055188895b9b7a5fb69a777e13642e41ad02cbbf13a518a |
Hashes for simstring_fast-0.1.6-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21ea7de1c9ae2613eafacd4a4577b5f82af160ebada2d288b8f13d0e091de525 |
|
MD5 | ddd12067a5c5fb6ae191d1521c7cbec3 |
|
BLAKE2b-256 | 73f1745e8d9df4284b08b8fd9bc7475ce782b17b71c8e62eda2df522a1f77c2b |
Hashes for simstring_fast-0.1.6-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe1661ee6180e49dcd742c35ee2cb2d122912761ed2946d3a9a2d9aa66509f49 |
|
MD5 | 1993c0b6b261b31494338d3c5c6fec0d |
|
BLAKE2b-256 | 5f4f526ab388fe2e8e544813c59e0f0d2e7b312c55dbdd1b752a692c77e3395c |
Hashes for simstring_fast-0.1.6-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3eadda4a83dec4334ae590a8f022e485f72924f776867763488f3b16c574a5d |
|
MD5 | c56151eb18b46003a90a9f13461d7fa9 |
|
BLAKE2b-256 | 0bf2a13eddd1d6cf4986c1c380466fb440746c96894eb9245573edb879f593c7 |