A manual spell checker built on pyenchant that allows you to swiftly correct misspelled words.
Project description
Manual Spell Checker
A manual spell checker built on pyenchant that allows you to swiftly correct misspelled words.
Why does this exist?
While I was working on a text based multi-class classification competition, I noticed that the data contained a lot of misspelled words, errors which automated spell check packages out there couldn't fix. This was because the data had been compiled based on a survey of people who weren't native English speakers. As the there weren't many samples in the dataset (~1000), I decided to write some code for automated detection of spelling errors which I could then fix manually, and thus, this package was born.
How to install?
pip install manual_spellchecker
Features
- All features as provided by pyenchant
- Quickly analyze and get a list of all misspelled words
- Can replace, skip and delete misspelled words
- Use your favourite tokenizer for splitting words
- Replaced misspelled words via provided suggestions by simply typing in their indices
- Can checkpoint current set of corrections
- Contexualized pretty printing for easy visual correction (works on both command line and notebook)
Functions and Parameters
# Initialize the spell checking object
__init__(dataframe, column_names, tokenizer=None, num_n_words_dis=5, save_path=None)
- dataframe - Takes a pandas dataframe as input
- column_names - Pass the column name(s) upon which you want to perform spelling correction
- tokenizer=None - Pass your favourite tokenizer like nltk or spacy, etc. (Default: splits on space)
- num_n_words_dis=5 - This decides how many neighbouring words to display on either side of the error
- save_path=None - If a save path is provided, the final corrected dataframe is saved as a csv. (Default: the dataframe is not saved but returned)
# For quick analysis of all the misspelled words
spell_check()
# Returns a list of all the misspelled words
get_all_errors()
# Starts the process of correcting erroneous words
correct_words()
Important Note:
- Type -999 into the input box to stop the error correction and save the current progress (if save_path is provided)
- Simply press enter if you want to skip the current word
- Type in "" or '' in the input box to delete a misspelled word
Usage
How to import?
from manual_spellchecker import spell_checker
Quick analysis of the total number of errors
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Quick analysis
ob.spell_check()
Multiple columns can be passed for spelling correction
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, ["text", "label"])
# Quick analysis
ob.spell_check()
You can pass your own tokenizers
# Import nltk's word tokenizer
from nltk import word_tokenize
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text", word_tokenize)
# Quick analysis
ob.spell_check()
Get a list of all the errors
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Quick analysis. This needs to be performed before getting all errors
ob.spell_check()
# Returns a list of all errors
ob.get_all_errors()
Make corrections
# Read the data
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text")
# Start corrections
ob.correct_words()
To save
df = pd.read_csv("Train.csv")
# Initialize the model
ob = spell_checker(df, "text", save_path="correct_train_data.csv")
Future Ideas
- Will be adding automated, contextual error corrections
Feature Request
Drop me an email at atif.hit.hassan@gmail.com if you want any particular feature
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for manual_spellchecker-1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5f4d294df604f5360d9bfbc379693ba417c8a844befa1e12c9ae5abf8899fd7 |
|
MD5 | 9ef8a779f6e444d4c32414c2ee4e19f4 |
|
BLAKE2b-256 | 24370811d905f260950aba4e5056ff56cd265dcb6a7739841c93f435fe6f6bbe |