A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed
Project description
simstring
A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.
Docs are here
Features
With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.
This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.
SimString has the following features:
- Fast algorithm for approximate string retrieval.
- 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
- Unicode support.
- Extensibility. You can implement your own feature extractor easily.
- no japanese support Please see this paper for more details.
Install
pip install simstring-fast
Usage
from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure import CosineMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')
searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']
If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.
from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure import JaccardMeasure
from simstring.database import DictDatabase
from simstring.searcher import Searcher
db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')
searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)
Supported String Similarity Measures
- Cosine
- Dice
- Jaccard
- Overlap
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for simstring_fast-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee0e7c516f2af4e6bbc59e74ec261214db9f5b6685e17127450b44d3e6aa408b |
|
MD5 | 9e181ee57b43c2ddeb6907ab8ceee8d5 |
|
BLAKE2b-256 | 92825de63c2f9f48025b7665ae7548abe9d716ba8962b455be998e7e316c83d4 |
Hashes for simstring_fast-0.1.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89b48c617b53a189a479f8eecbfe136f51253e38eb91b59b6cf6b06bcd78d11c |
|
MD5 | d1c9a1a264e642db87970d9a355e998d |
|
BLAKE2b-256 | 679df0777b5d685a1d12ea3866239b468cbc39030241f9a6b0b8459125b1bd2e |