A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a753b8dacbcecd2ec18a2d1c431f457055cb9223019a5e515e95db0a170369 |
|
MD5 | 60dbf457e9f2ded223bbd767e0531d50 |
|
BLAKE2b-256 | 8ec2e2f5f75aeadef9fb1073dcc2855eb1e5bcf714f81ea78891427d447139ff |
Hashes for simstring_fast-0.2.6-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e68366dd82313d003164280af03501d7447f4efc71c0517a2e27e615fe6bafa |
|
MD5 | 6717c56b3b5f035cd72f68cf6196d2b1 |
|
BLAKE2b-256 | c635944f09445cd4c12a55186afd98fbd7db68ad6873c223e7f39d94aa59dbc0 |
Hashes for simstring_fast-0.2.6-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45cee39e711932ab6d4d54cc478f2712e556c15e2de26916bc4854933f5ba724 |
|
MD5 | 8d4a99ddc4cec5c4fb5472df10d22a12 |
|
BLAKE2b-256 | 669438dd3dea5506d72ada8051bbce3c8dacf6a840f5c1a7915f3738d226284f |
Hashes for simstring_fast-0.2.6-cp38-cp38-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 13d0188465bd88745c5d47c700592ac8b37d99301dbd9bf5438add12c23249ab |
|
MD5 | e4bccc6ce383126eeb30a604e285189a |
|
BLAKE2b-256 | 964cecec15d645ee37703f80347115ca263cc1ac2cb24a202a58efe5c8dab863 |