Skip to main content

Character trigram fuzzy set.

Project description

Character trigram fuzzy set implementation providing cosine similarity-based fuzzy matching.

This library does that one thing on iterables of strings. Any beyond that–Levenshtein distance, scoring, bigram fallback, etc.–is left as an exercise to the reader.

Usage

import os.path
from timeit import timeit
import requests

# Retrieve a file containing around 470,000 English words
url = 'https://github.com/dwyl/english-words/raw/master/words.txt'
r = requests.get(url, stream=True)
words_path = os.path.expanduser('~/words.txt')
if not os.path.isfile(words_path):
    with open(words_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

# Usage
import charactertrigramfuzzyset as ctfs
items = [line.rstrip() for line in open(words_path, 'r')]
fs = ctfs.CharacterTrigramFuzzySet(items)
fs.get('bryan')

# Profiling, generally around 10-20 ms per call on my machine
timeit("fs.get('bryan')", setup='''
import charactertrigramfuzzyset as ctfs
items = [line.rstrip() for line in open('{words_path}', 'r')]
fs = ctfs.CharacterTrigramFuzzySet(items)
'''.format(words_path=words_path), number=1000)

Project details


Release history Release notifications

This version
History Node

0.0.2

History Node

0.0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
charactertrigramfuzzyset-0.0.2-py3-none-any.whl (2.8 kB) Copy SHA256 hash SHA256 Wheel 3.6 Apr 29, 2018
charactertrigramfuzzyset-0.0.2.tar.gz (3.4 kB) Copy SHA256 hash SHA256 Source None Apr 29, 2018

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page