Skip to main content

A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed

Project description

simstring

PyPI - Status PyPI version PyPI - Python Version MIT License

icon

A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.

Docs are here

Features

With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.

This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.

SimString has the following features:

  • Fast algorithm for approximate string retrieval.
  • 100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
  • Unicode support.
  • Extensibility. You can implement your own feature extractor easily.
  • no japanese support Please see this paper for more details.

Install

pip install simstring-fast

Usage

from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')

searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']

If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.

from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')

searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)

Supported String Similarity Measures

  • Cosine
  • Dice
  • Jaccard
  • Overlap

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simstring_fast-0.2.6.tar.gz (7.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_fast-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl (234.3 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

simstring_fast-0.2.6-cp310-cp310-manylinux_2_35_x86_64.whl (235.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

simstring_fast-0.2.6-cp39-cp39-manylinux_2_35_x86_64.whl (235.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.35+ x86-64

simstring_fast-0.2.6-cp38-cp38-manylinux_2_35_x86_64.whl (231.7 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.35+ x86-64

File details

Details for the file simstring_fast-0.2.6.tar.gz.

File metadata

  • Download URL: simstring_fast-0.2.6.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for simstring_fast-0.2.6.tar.gz
Algorithm Hash digest
SHA256 975927b3657cd2bce767f36cba120a34ffcac94f5d53e49694430458eaed340d
MD5 68536296b4d9728b2e378063b8f6504b
BLAKE2b-256 a748036b7316ee7e73035c7bf54d429522b211fadba26fdcf13592a9e6731832

See more details on using hashes here.

File details

Details for the file simstring_fast-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for simstring_fast-0.2.6-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 41a753b8dacbcecd2ec18a2d1c431f457055cb9223019a5e515e95db0a170369
MD5 60dbf457e9f2ded223bbd767e0531d50
BLAKE2b-256 8ec2e2f5f75aeadef9fb1073dcc2855eb1e5bcf714f81ea78891427d447139ff

See more details on using hashes here.

File details

Details for the file simstring_fast-0.2.6-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for simstring_fast-0.2.6-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 6e68366dd82313d003164280af03501d7447f4efc71c0517a2e27e615fe6bafa
MD5 6717c56b3b5f035cd72f68cf6196d2b1
BLAKE2b-256 c635944f09445cd4c12a55186afd98fbd7db68ad6873c223e7f39d94aa59dbc0

See more details on using hashes here.

File details

Details for the file simstring_fast-0.2.6-cp39-cp39-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for simstring_fast-0.2.6-cp39-cp39-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 45cee39e711932ab6d4d54cc478f2712e556c15e2de26916bc4854933f5ba724
MD5 8d4a99ddc4cec5c4fb5472df10d22a12
BLAKE2b-256 669438dd3dea5506d72ada8051bbce3c8dacf6a840f5c1a7915f3738d226284f

See more details on using hashes here.

File details

Details for the file simstring_fast-0.2.6-cp38-cp38-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for simstring_fast-0.2.6-cp38-cp38-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 13d0188465bd88745c5d47c700592ac8b37d99301dbd9bf5438add12c23249ab
MD5 e4bccc6ce383126eeb30a604e285189a
BLAKE2b-256 964cecec15d645ee37703f80347115ca263cc1ac2cb24a202a58efe5c8dab863

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page