masakhanePreprocessor is an effective language-first preprocessing tool for African languages
Project description
masakhanePreprocessor
An effective language-first preprocessing tool for African languages (🔧 Beta version).
We build on the clean-text preprocessor.
How to Use
Install:
git clone https://github.com/masakhane-io/masakhanePreprocessor.git
cd masakhanePreprocessor
pip install .
Preprocessor
You only need to specify your language and it loads the important preprocessing style for You!
You initialize the Preprocessor
in Python as follows:
from masakhanePreprocessor import Preprocessor
my_prep = Preprocessor(lang='ig')
You can also directly include some additional parameters you want:
my_prep = Preprocessor(lang='ig',
lower=True,
strip_punctuation=True,
strip_symbols=True)
preproces_str
To preprocess a string use the preproces_str
function:
clean_text = my_prep.preprocess_str('''Dịka● ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ
Anambra steeti si na-aga nke afọ 2021, ndị nọ.''')
You get the following as output:
Dịka ndọrọndọrọọchịchị maka ntuliaka ọkwa Gọvanọ Anambra steeti si na-aga nke afọ 2021 ndị nọ
Notice how the
●
character has been removed, but the-
, which is an important part of Igbo, remains untouched.
preprocess_file
To preprocess a file use the preprocess_file
function:
my_prep.preprocess_file('ig.txt',
output_path=None #Specify the output path. If unspecified, uses the parent directory of input file)
On successful completion you get this message:
Clean file(s) saved successfully to xxxxxxx/ig_CLEAN.txt
Properties of the preprocessing tool
-
Language-first It can:
- map any African language name provided to its language code. You can write
Preprocessor(lang='yoruba')
using just the name. - map any language code to its BCP47 variant. So even if you use
yo
oryor
it does not matter.
- map any African language name provided to its language code. You can write
-
Simple to use
Contribution
We are open to and grateful for ideas to make this better. You can propose ideas as issues or pull requests.
With 💙 From The Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for masakhanePreprocessor-0.0.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bb7d30b011c3831c201da74d98de5e41cb5c23517f3cd089317fec4cdfb6833 |
|
MD5 | c71dee0741ea56b3cef9d0835648c1d2 |
|
BLAKE2b-256 | 62bd0392b904d16a0918941ecf176c110ad777da78b62b8decb89a840591a8fd |
Hashes for masakhanePreprocessor-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b03b24c037e93f3ac230e07725d1ea9987b70c8a8f4b65ebbd53e0b5250aa21f |
|
MD5 | 95d57b4ad1c512daadb7ab44bd6a26e0 |
|
BLAKE2b-256 | db933b26cb1fff410c73e2089e0d02c8bb2e872f0fe7c979ea44baf31a055846 |