A fork of the Python implementation of the SimString by (Katsuma Narisawa), a simple and efficient algorithm for approximate string matching. Uses mypyc to improve speed

These details have not been verified by PyPI

Project links

Project description

simstring

PyPI - Python Version

A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.

Features

With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.

This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.

SimString has the following features:

Fast algorithm for approximate string retrieval.
100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
Unicode support.
Extensibility. You can implement your own feature extractor easily.
no japanese support Please see this paper for more details.

Install

pip install simstring-fast

Usage

from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')

searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']

If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.

from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.mongo import MongoDatabase
from simstring.searcher import Searcher

db = MongoDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')

searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)

Supported String Similarity Measures

Cosine
Dice
Jaccard
Overlap

Run Tests

docker-compose run main bash -c 'source activate simstring && python -m pytest'

Benchmark

SWIG bindings of simstring achieve
About 1ms to search strings from 5797 strings(company names).
About 14ms to search strings from 235544 strings(unabridged dictionary).
but there are ome odd bugs in the original implimentation that don't agree with the implimentation here.
adding mypyc halved the benchark time on my system, your mileage may vary.

search from `dev/data/company_names.txt`

$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(5797 lines)    0.1227    0.1200    0.1200    0.0000
search text(5797 times)    6.9719    6.9400    6.8900    0.0500

## Ranking                real
initialize database(5797 lines)    0.1227  (100.0) ********************
search text(5797 times)    6.9719  (  1.8)

## Matrix                 real    [01]    [02]
[01] initialize database(5797 lines)    0.1227   100.0  5680.9
[02] search text(5797 times)    6.9719     1.8   100.0

benchmark for using Mongo as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(5797 lines)    4.5762    2.4900    1.9200    0.5700
search text(5797 times)  177.8401   60.9100   47.2500   13.6600

## Ranking                real
initialize database(5797 lines)    4.5762  (100.0) ********************
search text(5797 times)  177.8401  (  2.6) *

## Matrix                 real    [01]    [02]
[01] initialize database(5797 lines)    4.5762   100.0  3886.2
[02] search text(5797 times)  177.8401     2.6   100.0

search from `dev/data/unabridged_dictionary.txt`

$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(235544 lines)    2.2576    2.2300    2.1200    0.1100
search text(10000 times)  141.0302  140.6400  139.9600    0.6800

## Ranking                real
initialize database(235544 lines)    2.2576  (100.0) ********************
search text(10000 times)  141.0302  (  1.6)

## Matrix                 real    [01]    [02]
[01] initialize database(235544 lines)    2.2576   100.0  6246.8
[02] search text(10000 times)  141.0302     1.6   100.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Oct 24, 2023

0.2.7

Sep 1, 2023

0.2.6

May 12, 2023

0.2.4

May 12, 2023

0.2.0

May 12, 2023

0.1.7

Feb 14, 2023

0.1.6

Feb 14, 2023

0.1.5

Feb 13, 2023

0.1.4

Nov 22, 2022

0.1.0

Jun 13, 2022

This version

0.0.2

Jun 13, 2022

0.0.1

Dec 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simstring-fast-0.0.2.tar.gz (10.6 kB view hashes)

Uploaded Jun 13, 2022 Source

Built Distribution

simstring_fast-0.0.2-cp37-cp37m-win_amd64.whl (179.1 kB view hashes)

Uploaded Jun 13, 2022 CPython 3.7m Windows x86-64

Hashes for simstring-fast-0.0.2.tar.gz

Hashes for simstring-fast-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1ee52662ef6be86e75ffc274e50a0c51c840e299fa000f13816846378eb4dec5`
MD5	`abb44f58f47cdba902b76d3979ffc452`
BLAKE2b-256	`5b9f9dee5a02f9e41564f431126664b951bf17ded59874d906c60a14ba3956b8`

Hashes for simstring_fast-0.0.2-cp37-cp37m-win_amd64.whl

Hashes for simstring_fast-0.0.2-cp37-cp37m-win_amd64.whl
Algorithm	Hash digest
SHA256	`ba639a1d6666cd6ddfa8d7ada4f06f2f318d4c507280628e1923f0a13fae8ec1`
MD5	`494445e1e7ba9c0d3cda487330635df0`
BLAKE2b-256	`4265e01b3d8c5ff73108756a4680fe14a1861e7ab72a30aa2f516fba74e60f71`

simstring-fast 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

simstring

Features

Install

Usage

Supported String Similarity Measures

Run Tests

Benchmark

search from `dev/data/company_names.txt`

search from `dev/data/unabridged_dictionary.txt`

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

simstring-fast 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

simstring

Features

Install

Usage

Supported String Similarity Measures

Run Tests

Benchmark

search from dev/data/company_names.txt

search from dev/data/unabridged_dictionary.txt

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

search from `dev/data/company_names.txt`

search from `dev/data/unabridged_dictionary.txt`