Skip to main content

A package that split mispelled words semantically

Project description

French word segmentation

Usually when extracting text from open source OCRs like Tesseract, we're most likely to encounter linked words duo to OCR quality extraction.

For example : instead of extracting "Très bon service", one might get sudenlly "Très bonservice". So when doing feature engineering with BOW, TFIDF or even word2vec models, the algorithm will consider that "bonservice" as a unique feature, while it is not.

To deal with this problem, I built a module dealing with semantic word segmentation without any predefined corpus.

Installation

Use the package manager pip to install fr_word_segment.

pip3 install fr-word-segment
python3 -m spacy download fr

Usage

from fr_word_segment import wordseg
# suppose that a french spellchecker detect this token as misspelled
token = "soitmoinscompliqué"

# apply segmentation function on the given token
result = wordseg.segment_token(token)

# show results
print("raw token is {}".format(token)) # "soitmoinscompliqué"
print("processed token is {}".format(result)) # "soit moins compliqué"

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fr_word_segment-0.1.3.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fr_word_segment-0.1.3-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file fr_word_segment-0.1.3.tar.gz.

File metadata

  • Download URL: fr_word_segment-0.1.3.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for fr_word_segment-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8740d6a110994ca9368891fbdcf1a6d4e03a22fc462f69d08329893fb2766985
MD5 55b0a1cd39a123b647f1abb267b148e9
BLAKE2b-256 37946d912d1ba5d63b00772d0ca62356d4cd366addba66717e3ab2f9d573c381

See more details on using hashes here.

File details

Details for the file fr_word_segment-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: fr_word_segment-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for fr_word_segment-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 46910181fed43e0543a3c1c098faa42f4fb8d7ddb0cc772c41892720202dbd6d
MD5 4daac8bc71ba446a696b9a32ab3b4572
BLAKE2b-256 3ff0b36b01dcc644c7e508381075061376d8b1607af1c3963408e147c51b030c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page