Python SymSpell
Project description
symspellpy
symspellpy is a Python port of SymSpell v6.3, which provides much higher speed and lower memory consumption. Unit tests from the original project are implemented to ensure the accuracy of the port.
Please note that the port has not been optimized for speed.
Usage
Installing the symspellpy
module
pip install -U symspellpy
Copying the frequency dictionary to your project
Copy frequency_dictionary_en_82_765.txt
(found in the inner symspellpy
directory) to your project directory so you end up with the following layout:
project_dir
+-frequency_dictionary_en_82_765.txt
\-project.py
Adding new terms
- Use
load_dictionary(corpus=<path/to/dictionary.txt>, <term_index>,<count_index>)
.dictionary.txt
should contain:
<term> <count>
<term> <count>
...
<term> <count>
with term_index
indicating the column number of terms and count_index
indicating the column number of counts/frequency.
- Append
<term> <count>
to the providedfrequency_dictionary_en_82_765.txt
- Use the method
create_dictionary_entry(key=<term>, count=<count>)
Sample usage (create_dictionary
)
import os
from symspellpy.symspellpy import SymSpell # import the module
def main():
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# create dictionary using corpus.txt
if not sym_spell.create_dictionary(<path/to/corpus.txt>):
print("Corpus file not found")
return
for key, count in sym_spell.words.items():
print("{} {}".format(key, count))
if __name__ == "__main__":
main()
corpus.txt
should contain:
abc abc-def abc_def abc'def abc qwe qwe1 1qwe q1we 1234 1234
Expected output:
abc 4
def 2
abc'def 1
qwe 1
qwe1 1
1qwe 1
q1we 1
1234 2
Sample usage (lookup
and lookup_compound
)
Using project.py
(code is more verbose than required to allow explanation of method arguments)
import os
from symspellpy.symspellpy import SymSpell, Verbosity # import the module
def main():
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = os.path.join(os.path.dirname(__file__),
"frequency_dictionary_en_82_765.txt")
term_index = 0 # column of the term in the dictionary text file
count_index = 1 # column of the term frequency in the dictionary text file
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
return
# lookup suggestions for single-word input strings
input_term = "memebers" # misspelling of "members"
# max edit distance per lookup
# (max_edit_distance_lookup <= max_edit_distance_dictionary)
max_edit_distance_lookup = 2
suggestion_verbosity = Verbosity.CLOSEST # TOP, CLOSEST, ALL
suggestions = sym_spell.lookup(input_term, suggestion_verbosity,
max_edit_distance_lookup)
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions:
print("{}, {}, {}".format(suggestion.term, suggestion.distance,
suggestion.count))
# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
input_term = ("whereis th elove hehad dated forImuch of thepast who "
"couqdn'tread in sixtgrade and ins pired him")
# max edit distance per lookup (per single word, not per whole input string)
max_edit_distance_lookup = 2
suggestions = sym_spell.lookup_compound(input_term,
max_edit_distance_lookup)
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
print("{}, {}, {}".format(suggestion.term, suggestion.distance,
suggestion.count))
if __name__ == "__main__":
main()
Expected output:
members, 1, 226656153
where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 9, 300000
Sample usage (word_segmentation
)
Using project.py
(code is more verbose than required to allow explanation of
method arguments)
import os
from symspellpy.symspellpy import SymSpell # import the module
def main():
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 0
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = os.path.join(os.path.dirname(__file__),
"frequency_dictionary_en_82_765.txt")
term_index = 0 # column of the term in the dictionary text file
count_index = 1 # column of the term frequency in the dictionary text file
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
return
# a sentence without any spaces
input_term = "thequickbrownfoxjumpsoverthelazydog"
result = sym_spell.word_segmentation(input_term)
# display suggestion term, term frequency, and edit distance
print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
result.log_prob_sum))
if __name__ == "__main__":
main()
Expected output:
the quick brown fox jumps over the lazy dog 8 -34.491167981910635
Transferring casing
To transfer the casing (eg uppercase/lowercase) from the original phrase
to the typo-corrected one, use the transfer_casing
boolean flag of
the lookup()
and the lookup_compound()
methods:
lookup_compound()
:
suggestions = sym_spell.lookup_compound(input_term,
max_edit_distance_lookup,
transfer_casing=True)
lookup()
:
suggestions = sym_spell.lookup(input_term,
suggestion_verbosity,
max_edit_distance_lookup,
transfer_casing=True)
CHANGELOG
6.3.9 (2019-08-06)
- Added
transfer_casing
tolookup
andlookup_compound
- Fixed prefix length check in
_edits_prefix
6.3.8 (2019-03-21)
- Implemented
delete_dictionary_entry
- Improved performance by using python builtin hashing
- Added versioning of the pickle
6.3.7 (2019-02-18)
- Fixed
include_unknown
inlookup
- Removed unused
initial_capacity
argument - Improved
_get_str_hash
performance - Implemented
save_pickle
andload_pickle
to avoid having to create the dictionary every time
6.3.6 (2019-02-11)
- Added
create_dictionary()
feature
6.3.5 (2019-01-14)
- Fixed
lookup_compound()
to return the correctdistance
6.3.4 (2019-01-04)
- Added
<self._replaced_words = dict()>
to track number of misspelled words - Added
ignore_token
toword_segmentation()
to ignore words with regular expression
6.3.3 (2018-12-05)
- Added
word_segmentation()
feature
6.3.2 (2018-10-23)
- Added
encoding
option toload_dictionary()
6.3.1 (2018-08-30)
- Create a package for
symspellpy
6.3.0 (2018-08-13)
- Ported SymSpell v6.3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file symspellpy-6.3.9.tar.gz
.
File metadata
- Download URL: symspellpy-6.3.9.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 218ae731df01ed0953c21a0a9472e7fec022d974d733b2cf3fdcf78199f52d02 |
|
MD5 | d4a7549fbbf7555a057d9577f1e1dc55 |
|
BLAKE2b-256 | 1769b2c8ee8b2649c5577271162b10adb03d8720d372aa378afb66109797fed5 |
File details
Details for the file symspellpy-6.3.9-py3-none-any.whl
.
File metadata
- Download URL: symspellpy-6.3.9-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e33706475ab5d1ab1ac7857fbc251dc59364c73ac7295eb907b1adec8f641cc7 |
|
MD5 | f9cdbeac7bccf112854c0bb1acc5a85e |
|
BLAKE2b-256 | 2fc28a15e2d16d22644afa208317a445f46b1e3157ad681dc5f31d6a25a8113e |