Skip to main content

masakhanePreprocessor is an effective language-first preprocessing tool for African languages

Project description

masakhanePreprocessor

An effective language-first preprocessing tool for African languages (🔧 Beta version).

We build on the clean-text preprocessor.

How to Use Open In Colab

Install:

git clone https://github.com/masakhane-io/masakhanePreprocessor.git
cd masakhanePreprocessor
pip install .

Preprocessor

You only need to specify your language and it loads the important preprocessing style for You!

You initialize the Preprocessor in Python as follows:

from masakhanePreprocessor import Preprocessor

my_prep = Preprocessor(lang='ig')

You can also directly include some additional parameters you want:

my_prep = Preprocessor(lang='ig',
              lower=True,
              strip_punctuation=True,
              strip_symbols=True)

preproces_str

To preprocess a string use the preproces_str function:

clean_text = my_prep.preprocess_str('''Dịka● ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ
                                       Anambra steeti si na-aga nke afọ 2021, ndị nọ.''')

You get the following as output: Dịka ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ Anambra steeti si na-aga nke afọ 2021 ndị nọ

Notice how the character has been removed, but the -, which is an important part of Igbo, remains untouched.

preprocess_file

To preprocess a file use the preprocess_file function:

my_prep.preprocess_file('ig.txt',
                        output_path=None #Specify the output path. If unspecified, uses the parent directory of input file)

On successful completion you get this message: Clean file(s) saved successfully to xxxxxxx/ig_CLEAN.txt

Properties of the preprocessing tool

  1. Language-first It can:

    • map any African language name provided to its language code. You can write Preprocessor(lang='yoruba') using just the name.
    • map any language code to its BCP47 variant. So even if you use yo or yor it does not matter.
  2. Simple to use

Contribution

We are open to and grateful for ideas to make this better. You can propose ideas as issues or pull requests.


With 💙 From The Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

masakhanePreprocessor-0.0.6.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

masakhanePreprocessor-0.0.6-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file masakhanePreprocessor-0.0.6.tar.gz.

File metadata

  • Download URL: masakhanePreprocessor-0.0.6.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for masakhanePreprocessor-0.0.6.tar.gz
Algorithm Hash digest
SHA256 6bb7d30b011c3831c201da74d98de5e41cb5c23517f3cd089317fec4cdfb6833
MD5 c71dee0741ea56b3cef9d0835648c1d2
BLAKE2b-256 62bd0392b904d16a0918941ecf176c110ad777da78b62b8decb89a840591a8fd

See more details on using hashes here.

File details

Details for the file masakhanePreprocessor-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: masakhanePreprocessor-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for masakhanePreprocessor-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b03b24c037e93f3ac230e07725d1ea9987b70c8a8f4b65ebbd53e0b5250aa21f
MD5 95d57b4ad1c512daadb7ab44bd6a26e0
BLAKE2b-256 db933b26cb1fff410c73e2089e0d02c8bb2e872f0fe7c979ea44baf31a055846

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page