Skip to main content

Python SymSpell

Project description

symspellpy
Build Status codecov

symspellpy is a Python port of SymSpell v6.3, which provides much higher speed and lower memory consumption. Unit tests from the original project are implemented to ensure the accuracy of the port.

Please note that the port has not been optimized for speed.

Usage

Installing the symspellpy module

pip install -U symspellpy

Copying the frequency dictionary to your project

Copy frequency_dictionary_en_82_765.txt (found in the inner symspellpy directory) to your project directory so you end up with the following layout:

project_dir
  +-frequency_dictionary_en_82_765.txt
  \-project.py

Adding new terms

  • Use load_dictionary(corpus=<path/to/dictionary.txt>, <term_index>, <count_index>). dictionary.txt should contain:
<term> <count>
<term> <count>
...
<term> <count>

with term_index indicating the column number of terms and count_index indicating the column number of counts/frequency.

  • Append <term> <count> to the provided frequency_dictionary_en_82_765.txt
  • Use the method create_dictionary_entry(key=<term>, count=<count>)

Sample usage (lookup and lookup_compound)

Using project.py (code is more verbose than required to allow explanation of method arguments)

import os

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

def main():
    # create object
    initial_capacity = 83000
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 2
    prefix_length = 7
    sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary,
                         prefix_length)
    # load dictionary
    dictionary_path = os.path.join(os.path.dirname(__file__),
                                   "frequency_dictionary_en_82_765.txt")
    term_index = 0  # column of the term in the dictionary text file
    count_index = 1  # column of the term frequency in the dictionary text file
    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return

    # lookup suggestions for single-word input strings
    input_term = "memebers"  # misspelling of "members"
    # max edit distance per lookup
    # (max_edit_distance_lookup <= max_edit_distance_dictionary)
    max_edit_distance_lookup = 2
    suggestion_verbosity = Verbosity.CLOSEST  # TOP, CLOSEST, ALL
    suggestions = sym_spell.lookup(input_term, suggestion_verbosity,
                                   max_edit_distance_lookup)
    # display suggestion term, term frequency, and edit distance
    for suggestion in suggestions:
        print("{}, {}, {}".format(suggestion.term, suggestion.count,
                                  suggestion.distance))

    # lookup suggestions for multi-word input strings (supports compound
    # splitting & merging)
    input_term = ("whereis th elove hehad dated forImuch of thepast who "
                  "couqdn'tread in sixtgrade and ins pired him")
    # max edit distance per lookup (per single word, not per whole input string)
    max_edit_distance_lookup = 2
    suggestions = sym_spell.lookup_compound(input_term,
                                            max_edit_distance_lookup)
    # display suggestion term, edit distance, and term frequency
    for suggestion in suggestions:
        print("{}, {}, {}".format(suggestion.term, suggestion.count,
                                  suggestion.distance))

if __name__ == "__main__":
    main()
Expected output:

members, 226656153, 1

where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 300000, 10

Sample usage (word_segmentation)

Using project.py (code is more verbose than required to allow explanation of method arguments)

import os

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

def main():
      edit_distance_max = 0
      prefix_length = 7
      sym_spell = SymSpell(83000, edit_distance_max, prefix_length)
      sym_spell.load_dictionary(dictionary_path, 0, 1)

      typo = "thequickbrownfoxjumpsoverthelazydog"
      correction = "the quick brown fox jumps over the lazy dog"
      result = sym_spell.word_segmentation(typo)
    # create object
    initial_capacity = 83000
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 0
    prefix_length = 7
    sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary,
                         prefix_length)
    # load dictionary
    dictionary_path = os.path.join(os.path.dirname(__file__),
                                   "frequency_dictionary_en_82_765.txt")
    term_index = 0  # column of the term in the dictionary text file
    count_index = 1  # column of the term frequency in the dictionary text file
    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return

    # a sentence without any spaces
    input_term = "thequickbrownfoxjumpsoverthelazydog"

    result = sym_spell.word_segmentation(input_term)
    # display suggestion term, term frequency, and edit distance
    print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
                              result.log_prob_sum))

if __name__ == "__main__":
    main()
Expected output:

the quick brown fox jumps over the lazy dog 8 -34.491167981910635

CHANGELOG

6.3.5 (2019-01-14)


  • Fixed lookup_compound() to return the correct distance

6.3.4 (2019-01-04)


  • Added <self._replaced_words = dict()> to track number of misspelled words
  • Added ignore_token to word_segmentation() to ignore words with regular expression

6.3.3 (2018-12-05)


  • Added word_segmentation() feature

6.3.2 (2018-10-23)


  • Added encoding option to load_dictionary()

6.3.1 (2018-08-30)


  • Create a package for symspellpy

6.3.0 (2018-08-13)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symspellpy-6.3.6.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

symspellpy-6.3.6-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file symspellpy-6.3.6.tar.gz.

File metadata

  • Download URL: symspellpy-6.3.6.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1

File hashes

Hashes for symspellpy-6.3.6.tar.gz
Algorithm Hash digest
SHA256 6375756f36230be3e465164727ca958554aaa3707b341dd078d4e2d273b8e7b6
MD5 4236251b27dca9202e715bbac9029efd
BLAKE2b-256 8dcc888cacf7fb92f760e1f0ef167bdaae16308e235774c2ff8a0f84adf2522b

See more details on using hashes here.

File details

Details for the file symspellpy-6.3.6-py3-none-any.whl.

File metadata

  • Download URL: symspellpy-6.3.6-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1

File hashes

Hashes for symspellpy-6.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 10f413967bb885ce43bc8dd9b1e5a2e8d8a6b6500e07409de49b04b206d11ffb
MD5 6bbe885ecf8030e34ee51bcaa46935ee
BLAKE2b-256 43afd7bec0d65aa916d47b92fa6ac5eafc9a833a3d2f8d7adabcef3ba47a52d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page