Skip to main content

find similar word / sentences using levenshtein-distance

Project description

Similar string search in Levenshtein-distance

Install

from source

git clone https://github.com/lovit/levenshtein_finder
cd levenshtein_finder
python setup.py install

from PyPI

pip install levenshtein_finder

Usage

Define Tokenizer

levenshtein_finder provides two basic tokenizers.

In [1]

from levenshtein_finder import CharacterTokenizer, BigramTokenizer

char_tokenizer = CharacterTokenizer()
bigram_tokenizer = BigramTokenizer()

char_tokenizer.tokenize("abc de")  # ['a', 'b', 'c', ' ', 'd', 'e']
bigram_tokenizer.tokenize("abc de")  # ['ab', 'bc', 'c ', ' d', 'de']

char_tokenizer.train("path/to/text")  # or
char_tokenizer.train(["list", "of", "words"])

When you use word piece tokenizer in Huggingface's tokenizers, wrap the class instance with WordpieceTokenizersWrapper

In [2]

from tokenizers import BertWordPieceTokenizer
from levenshtein_finder import WordpieceTokenizersWrapper

tokenizer = WordpieceTokenizersWrapper(BertWordPieceTokenizer())
tokenizer.train("path/to/text")
tokenizer.tokenize("abc")  # ["ab", "#c"]
tokenizer.encode("abc"  # [53, 2]
tokenizer.decode(tokenizer.tokenize("abc"))  # "abc"

Set tokenizer which Levenshtein Finder uses. Default is CharacterTokenizer

In [3]

from levenshtein_finder import LevenshteinFinder()

finder = LevenshteinFinder()
finder = LevenshteinFinder(BigramTokenizer())
finder = LevenshteinFinder(WordpieceTokenizersWrapper(BertWordPieceTokenizer()))

Indexing

In [4]

finder = LevenshteinFinder()
finder.indexing("path/to/text")

# when you use CharacterTokenizer or BigramTokenizer, you can index with list of words
finder.indexing(["list", "of", "words"])

Searching

Basic search. The form of similars is dict{term: distance}.

In [5]

finder.search("query")

If you want to check details in searching, set verbose=True

In [6]

finder.search("분식회계", verbose=True)

Out [6]

query               : 분식회계
tokens              : ['분', '식', '회', '계']
num data            : 132864
num 1st candidates  : 10137
num final candidates: 7
num similars        : 4
elapsed time        : 0.00542903 sec
[{'idx': 84243, 'data': '분식회계', 'distance': 0},
 {'idx': 36211, 'data': '분식회', 'distance': 1},
 {'idx': 113942, 'data': '분식회계설', 'distance': 1},
 {'idx': 114122, 'data': '분석회계', 'distance': 1}]

For decomposing Korean to Jamo (NFKD)

To decompose Korean character to jamo components ( -> [, , ]), use Normalizers.

In [7]

normalizer = Normalizers.create_normalizer(unicodedata=True)
tokenizer = CharacterTokenizer(normalizer=normalizer)
finder = LevenshteinFinder(tokenizer=tokenizer)
finder.indexing("text.txt")
finder.search("분식회계", verbose=True)

Out [7]

query               : 분식회계
tokens              : ['ᄇ', 'ᅮ', 'ᆫ', 'ᄉ', 'ᅵ', 'ᆨ', 'ᄒ', 'ᅬ', 'ᄀ', 'ᅨ']
num data            : 132864
num 1st candidates  : 125925
num final candidates: 4
num similars        : 2
elapsed time        : 0.0863643 sec
[{'idx': 84243, 'data': '분식회계', 'distance': 0},
 {'idx': 114122, 'data': '분석회계', 'distance': 1}]

Demo

Searching similar words

LevenshteinFinder needs seconds.

In [8]

import json
with open("data/nouns_from_financial_news.json") as f:
    words = list(json.load(f).keys())

finder = LevenshteinFinder()
finder.indexing(words)
finder.search("분식회계", verbose=True)

Out [8]

elapsed time        : 0.00542903 sec

Brute-force distance computation needs 1.495 seconds

In [9]

import time
from levenshtein_finder import levenshtein

query = '분식회계'

begin_time = time.time()
distance = {word:levenshtein(word, query) for word in words}
search_time = time.time() - begin_time
print('search time = {} sec'.format('%.3f'%search_time))

similars = sorted(filter(lambda x:x[1] <= 1, distance.items()), key=lambda x:x[1])
elapsed_time = time.time() - begin_time
print('total elapsed time = {} sec'.format('%.3f'%elapsed_time))
print(similars)

Out [9]

search time = 1.480 sec
total elapsed time = 1.495 sec
[('분식회계', 0), ('분식회', 1), ('분식회계설', 1), ('분석회계', 1)]

Searching similar strings

Find similar sentence from NSMC corpus (20M).

In [10]

finder = LevenshteinFinder()
finder.indexing("~/Korpora/nsmc/corpus.txt")
finder.search("음악이 주가 된 최고의 음악영화", max_distance=1, verbose=True)

Out [10]

query               : 음악이 주가 된 최고의 음악영화
tokens              : ['음', '악', '이', ' ', '주', '가', ' ', '된', ' ', '최', '고', '의', ' ', '음', '악', '영', '화']
num data            : 200002
num 1st candidates  : 189663
num final candidates: 1
num similars        : 1
elapsed time        : 0.123322 sec
[{'idx': 6, 'data': '음악이 주가 된, 최고의 음악영화', 'distance': 1}]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

levenshtein_finder-1.0.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

levenshtein_finder-1.0.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file levenshtein_finder-1.0.1.tar.gz.

File metadata

  • Download URL: levenshtein_finder-1.0.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for levenshtein_finder-1.0.1.tar.gz
Algorithm Hash digest
SHA256 31539ac616f41f87d765501befa55ddf5390eadb0f4405397aba64c86d83dc1c
MD5 ecee4a09253dbb85a41526363dc201c6
BLAKE2b-256 b1a2a752dbf3eb467dd99402056508e680ab22f4256558bd21bdc32f40d41b84

See more details on using hashes here.

File details

Details for the file levenshtein_finder-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: levenshtein_finder-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.7.9

File hashes

Hashes for levenshtein_finder-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 be895db740fac66a185a8fbe79bd81a7be48caa0548aabfc0c0b81a60c8fafa6
MD5 56ecd3935e269b2ab57d121bd29054cc
BLAKE2b-256 627d23cfd25ac805f4ea30adbabeb8c5cb518a7d29e44eb1275b52f306667314

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page