A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure import JaccardMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.1.5-cp311-cp311-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | abf0603ff35e0be56f951d7b7adeddf70582fa9fc067fbdb267cfc799e18da63 |
|
MD5 | 61d8173125402f530146c64e4e61cb08 |
|
BLAKE2b-256 | 3bd13879a3ae5b93a200f42cca9ee2a92743e84f284e29669ad8e59f530b9da3 |
Hashes for simstring_fast-0.1.5-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc5febca21c914a3cf1d3210ddbb49f40afd48eef649d173dd1b479438f5809f |
|
MD5 | 242c78ed359e9b0bbac5bebea7f2cd2c |
|
BLAKE2b-256 | db7e7ac524079440201b7635403f2818dff6b07958e3eaba1ea5ca0c79392198 |
Hashes for simstring_fast-0.1.5-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69d35fd050c62b9eecfd4c3cc280192541556ca7f3875469c1e47b009b39a9bd |
|
MD5 | 2f6099491856b622694baaed8f7ca31d |
|
BLAKE2b-256 | ccafad817c692d60df50b5d7a2201b45ab0e16356d3ba5e5c47447fe372e51af |
Hashes for simstring_fast-0.1.5-cp38-cp38-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ec98c32c0094a23c3d315b412c0fef68fe6431a2473051a9ab9dbb695abd4df |
|
MD5 | 8971e98168d8175d28b3014773c6b7d5 |
|
BLAKE2b-256 | bba22125e9a260c88f922169e42a5dd9fed0727fb1a52c6da28f22b369c71942 |