Skip to main content

A package that split mispelled words semantically

Project description

French word segmentation

Usually when extracting text from open source OCRs like Tesseract, we're most likely to encounter linked words duo to OCR quality extraction.

For example : instead of extracting "Très bon service", one might get sudenlly "Très bonservice". So when doing feature engineering with BOW, TFIDF or even word2vec models, the algorithm will consider that "bonservice" as a unique feature, while it is not.

To deal with this problem, I built a module dealing with semantic word segmentation without any predefined corpus.


Use the package manager pip to install frwordsegment.

pip install foobar


from wordseg import segment_token

# suppose that a french spellchecker detect this token as misspelled
token = "soitmoinscompliqué"

# apply segmentation function on the given token
result = segment_token(token)

# show results
print("raw token is {}".format(token)) # "soitmoinscompliqué"
print("processed token is {}".format(result)) # "soit moins compliqué"


Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.



Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for fr-word-segment, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size fr_word_segment-0.1.0-py3-none-any.whl (8.9 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size fr_word_segment-0.1.0.tar.gz (6.9 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page