Skip to main content

Python SymSpell

Project description

symspellpy
PyPI version Build Status Documentation Status codecov

symspellpy is a Python port of SymSpell v6.5, which provides much higher speed and lower memory consumption. Unit tests from the original project are implemented to ensure the accuracy of the port.

Please note that the port has not been optimized for speed.

Usage

Installing the symspellpy module

pip install -U symspellpy

Copying the frequency dictionary to your project

Copy frequency_dictionary_en_82_765.txt and frequency_bigramdictionary_en_243_342.txt (found in the inner symspellpy directory) to your project directory so you end up with the following layout:

project_dir
  +-frequency_dictionary_en_82_765.txt
  +-frequency_bigramdictionary_en_243_342.txt
  \-project.py

Adding new terms

  • Use load_dictionary(corpus=<path/to/dictionary.txt>, <term_index>,<count_index>). dictionary.txt should contain:
<term> <count>
<term> <count>
...
<term> <count>

with term_index indicating the column number of terms and count_index indicating the column number of counts/frequency.

  • Append <term> <count> to the provided frequency_dictionary_en_82_765.txt
  • Use the method create_dictionary_entry(key=<term>, count=<count>)

Sample usage (create_dictionary)

import os

from symspellpy.symspellpy import SymSpell  # import the module

def main():
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 2
    prefix_length = 7
    # create object
    sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)

    # create dictionary using corpus.txt
    if not sym_spell.create_dictionary(<path/to/corpus.txt>):
        print("Corpus file not found")
        return

    for key, count in sym_spell.words.items():
        print("{} {}".format(key, count))

if __name__ == "__main__":
    main()

corpus.txt should contain:

abc abc-def abc_def abc'def abc qwe qwe1 1qwe q1we 1234 1234

Expected output:

abc 4
def 2
abc'def 1
qwe 1
qwe1 1
1qwe 1
q1we 1
1234 2

Sample usage (lookup and lookup_compound)

Using project.py (code is more verbose than required to allow explanation of method arguments)

import pkg_resources

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

def main():
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 2
    prefix_length = 7
    # create object
    sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
    # load dictionary
    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    if not sym_spell.load_dictionary(dictionary_path, term_index=0,
                                     count_index=1):
        print("Dictionary file not found")
        return
    if not sym_spell.load_bigram_dictionary(bigram_path, term_index=0,
                                            count_index=2):
        print("Bigram dictionary file not found")
        return

    # lookup suggestions for single-word input strings
    input_term = "memebers"  # misspelling of "members"
    # max edit distance per lookup
    # (max_edit_distance_lookup <= max_edit_distance_dictionary)
    max_edit_distance_lookup = 2
    suggestion_verbosity = Verbosity.CLOSEST  # TOP, CLOSEST, ALL
    suggestions = sym_spell.lookup(input_term, suggestion_verbosity,
                                   max_edit_distance_lookup)
    # display suggestion term, term frequency, and edit distance
    for suggestion in suggestions:
        print("{}, {}, {}".format(suggestion.term, suggestion.distance,
                                  suggestion.count))

    # lookup suggestions for multi-word input strings (supports compound
    # splitting & merging)
    input_term = ("whereis th elove hehad dated forImuch of thepast who "
                  "couqdn'tread in sixtgrade and ins pired him")
    # max edit distance per lookup (per single word, not per whole input string)
    max_edit_distance_lookup = 2
    suggestions = sym_spell.lookup_compound(input_term,
                                            max_edit_distance_lookup)
    # display suggestion term, edit distance, and term frequency
    for suggestion in suggestions:
        print("{}, {}, {}".format(suggestion.term, suggestion.distance,
                                  suggestion.count))

if __name__ == "__main__":
    main()
Expected output:

members, 1, 226656153

where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 9, 0

Sample usage (word_segmentation)

Using project.py (code is more verbose than required to allow explanation of method arguments)

import pkg_resources

from symspellpy.symspellpy import SymSpell  # import the module

def main():
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 0
    prefix_length = 7
    # create object
    sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
    # load dictionary
    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    if not sym_spell.load_dictionary(dictionary_path, term_index=0,
                                     count_index=1):
        print("Dictionary file not found")
        return
    if not sym_spell.load_bigram_dictionary(dictionary_path, term_index=0,
                                            count_index=2):
        print("Bigram dictionary file not found")
        return

    # a sentence without any spaces
    input_term = "thequickbrownfoxjumpsoverthelazydog"

    result = sym_spell.word_segmentation(input_term)
    # display suggestion term, term frequency, and edit distance
    print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
                              result.log_prob_sum))

if __name__ == "__main__":
    main()
Expected output:

the quick brown fox jumps over the lazy dog 8 -34.491167981910635

Transferring casing

To transfer the casing (eg uppercase/lowercase) from the original phrase to the typo-corrected one, use the transfer_casing boolean flag of the lookup() and the lookup_compound() methods:

lookup_compound():

suggestions = sym_spell.lookup_compound(input_term,
                                        max_edit_distance_lookup,
                                        transfer_casing=True)

lookup():

suggestions = sym_spell.lookup(input_term,
                               suggestion_verbosity,
                               max_edit_distance_lookup,
                               transfer_casing=True)

CHANGELOG

6.5.2 (2019-10-23)


  • Modified load_bigram_dictionary to allow dictionary entries to be split into only 2 parts when using a custom separator
  • Added dictionary files to wheels so pkg_resources could be used to access them

6.5.1 (2019-10-08)


  • Added separator argument to allow user to choose custom separator for load_dictionary

6.5.0 (2019-09-21)


  • Added load_bigram_dictionary and bigram dictionary frequency_bigramdictionary_en_243_342.txt
  • Updated lookup_compound algorithm
  • Added Levenshtein to compute edit distance
  • Added save_pickle_stream and load_pickle_stream to save/load SymSpell data alongside other structure (contribution by marcoffee)

6.3.9 (2019-08-06)


  • Added transfer_casing to lookup and lookup_compound
  • Fixed prefix length check in _edits_prefix

6.3.8 (2019-03-21)


  • Implemented delete_dictionary_entry
  • Improved performance by using python builtin hashing
  • Added versioning of the pickle

6.3.7 (2019-02-18)


  • Fixed include_unknown in lookup
  • Removed unused initial_capacity argument
  • Improved _get_str_hash performance
  • Implemented save_pickle and load_pickle to avoid having to create the dictionary every time

6.3.6 (2019-02-11)


  • Added create_dictionary() feature

6.3.5 (2019-01-14)


  • Fixed lookup_compound() to return the correct distance

6.3.4 (2019-01-04)


  • Added <self._replaced_words = dict()> to track number of misspelled words
  • Added ignore_token to word_segmentation() to ignore words with regular expression

6.3.3 (2018-12-05)


  • Added word_segmentation() feature

6.3.2 (2018-10-23)


  • Added encoding option to load_dictionary()

6.3.1 (2018-08-30)


  • Create a package for symspellpy

6.3.0 (2018-08-13)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symspellpy-6.5.2.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

symspellpy-6.5.2-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file symspellpy-6.5.2.tar.gz.

File metadata

  • Download URL: symspellpy-6.5.2.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.22.0 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1

File hashes

Hashes for symspellpy-6.5.2.tar.gz
Algorithm Hash digest
SHA256 884a6496462288576f6ab5d6888f84e406e871031e0c64de1c3aa6c78d5ee35a
MD5 037dd1da48d328308b88cbca791d716b
BLAKE2b-256 f8a14a3bbb4d881946b493787a63c3d401f1fa38446a4b7a59e1977b56fe4a7c

See more details on using hashes here.

File details

Details for the file symspellpy-6.5.2-py3-none-any.whl.

File metadata

  • Download URL: symspellpy-6.5.2-py3-none-any.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.4.2 requests/2.22.0 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1

File hashes

Hashes for symspellpy-6.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ab1adacd40a616741181121e8927c9c4d43bb961756b97c642ed60dc3960b469
MD5 000af4b5059001aa23a6f98fef2b73a5
BLAKE2b-256 6d0b2daa14bf1ed649fff0d072b2e51ae98d8b45cae6cf8fdda41be01ce6c289

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page