A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
References
- SimString website: http://www.chokkan.org/software/simstring/
- SimString reference implementation (C++): https://github.com/chokkan/simstring
- SimString paper: http://www.aclweb.org/anthology/C10-1096
Install
pip install simstring
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.mongo import MongoDatabase
from simstring.searcher import Searcher
db = MongoDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
Run Tests
python -m unittest discover tests
n
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
simstring-pure-0.0.1.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for simstring_pure-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ebe7ed7121b0fd52ce6490d88114fc41f5c03760358c5880fe50e03a8a7c6dc |
|
MD5 | 7ce680a0c70e226b557827ab395d3ee8 |
|
BLAKE2b-256 | 533cac84a35621e5dad06bd11e026ddc7d82268eca387bd7865c2708ecd11591 |