masakhanePreprocessor is an effective language-first preprocessing tool for African languages
Project description
masakhanePreprocessor
An effective language-first preprocessing tool for African languages (🔧 Beta version).
We build on the clean-text preprocessor.
How to Use
Install:
git clone https://github.com/masakhane-io/masakhanePreprocessor.git
cd masakhanePreprocessor
pip install .
Preprocessor
You only need to specify your language and it loads the important preprocessing style for You!
You initialize the Preprocessor
in Python as follows:
from masakhanePreprocessor import Preprocessor
my_prep = Preprocessor(lang='ig')
You can also directly include some additional parameters you want:
my_prep = Preprocessor(lang='ig',
lower=True,
strip_punctuation=True,
strip_symbols=True)
preproces_str
To preprocess a string use the preproces_str
function:
clean_text = my_prep.preprocess_str('''Dịka● ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ
Anambra steeti si na-aga nke afọ 2021, ndị nọ.''')
You get the following as output:
Dịka ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ Anambra steeti si na-aga nke afọ 2021 ndị nọ
Notice how the
●
character has been removed, but the-
, which is an important part of Igbo, remains untouched.
preprocess_file
To preprocess a file use the preprocess_file
function:
my_prep.preprocess_file('ig.txt',
output_path=None #Specify the output path. If unspecified, uses the parent directory of input file)
On successful completion you get this message:
Clean file(s) saved successfully to xxxxxxx/ig_CLEAN.txt
Properties of the preprocessing tool
-
Language-first It can:
- map any African language name provided to its language code. You can write
Preprocessor(lang='yoruba')
using just the name. - map any language code to its BCP47 variant. So even if you use
yo
oryor
it does not matter.
- map any African language name provided to its language code. You can write
-
Simple to use
Contribution
We are open to and grateful for ideas to make this better. You can propose ideas as issues or pull requests.
With 💙 From The Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file masakhanePreprocessor-0.0.6.tar.gz
.
File metadata
- Download URL: masakhanePreprocessor-0.0.6.tar.gz
- Upload date:
- Size: 31.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bb7d30b011c3831c201da74d98de5e41cb5c23517f3cd089317fec4cdfb6833 |
|
MD5 | c71dee0741ea56b3cef9d0835648c1d2 |
|
BLAKE2b-256 | 62bd0392b904d16a0918941ecf176c110ad777da78b62b8decb89a840591a8fd |
File details
Details for the file masakhanePreprocessor-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: masakhanePreprocessor-0.0.6-py3-none-any.whl
- Upload date:
- Size: 31.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b03b24c037e93f3ac230e07725d1ea9987b70c8a8f4b65ebbd53e0b5250aa21f |
|
MD5 | 95d57b4ad1c512daadb7ab44bd6a26e0 |
|
BLAKE2b-256 | db933b26cb1fff410c73e2089e0d02c8bb2e872f0fe7c979ea44baf31a055846 |