Skip to main content

Pre-filtering step for obvious noise based on rules, poor language based on general language modelling and vulgar language based on specific language modelling

Project description

bicleaner-hardrules

License

Bicleaner hard-rules (bicleaner-hardrules) is a pre-filtering step for obvious noise based on rules, poor language based on general language modelling and vulgar language based on specific language modelling. It is part of Bicleaner.

Installation & Requirements

Bicleaner hard-rules is written in Python and can be installed using pip:

python3.7 -m pip install bicleaner-hardrules

Bicleaner hard-rules requires the KenLM Python bindings with support for 7-gram language models. You can easily install it by running the following commands:

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
mkdir -p build && cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

Since v1.3 hard-rules uses FastSpell that requires python-dev and libhunspell-dev:

sudo apt install python-dev libhunspell-dev

Also note that Hunspell language packages must be installed by hand if you are going to work with one of languages listed as similar, i.e.:

sudo apt-get install hunspell-es

or downloaded from an external source, such as https://github.com/wooorm/dictionaries/tree/main/dictionaries

You can also provide the path to the Hunspell dictionaries directories by using the dictpath atribute in {/YOUR/INSTALLATION/PATH}/config/hunspell.yaml (for example, venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml ) if you are installing from PyPI or with setup.py, or in /config/hunspell.yaml if you are running directly the code. Default path is /usr/share/hunspell.

The remaining modules will be automatically downloaded and installed/upgraded (if required) with the first command. After installation, a binary file (bicleaner-hardrules) will be located in your python/installation/prefix/bin directory. This is usually $HOME/.local/bin or /usr/local/bin/.

Cleaning

bicleaner-hardrules aims at detecting obvious noisy sentence pairs in a parallel corpus. Sentences that are considered noisy will be tagged with a 0 and the rest will be tagged with a 1.

By default, the input file (the parallel corpus to be classified) must contain at least two columns, being:

  • col1: Source sentence
  • col2: Target sentence

but the source and target sentences column index can be customized by using the --scol and --tcol flags, in case you have more columns.

The generated output file will contain the same lines and columns that the original input file had, adding an extra column containing the Bicleaner hard-rules tag.

This tool can be run with

bicleaner-hardrules [-h]
                    [--annotated_output]
                    -s SOURCE_LANG
                    -t TARGET_LANG
                    [--tmp_dir TMP_DIR]
                    [-b BLOCK_SIZE]
                    [-p PROCESSES]
                    [--run_all]
                    [--disable_lang_ident]
                    [--disable_minimal_length]
                    [--scol SCOL]
                    [--tcol TCOL]
                    [--disable_lm_filter]
                    [--disable_porn_removal]
                    [--metadata METADATA]
                    [--lm_threshold LM_THRESHOLD]
                    [-q]
                    [--debug]
                    [--logfile LOGFILE]
                    [input]
                    [output]

Parameters

  • positional arguments:

    • input: Tab-separated files to be classified (default line format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [EXTRA_COLUMNS], tab-separated). When input is -, reads standard input.
    • output: Output of the classification (default: standard output). When output is -, writes standard output.
  • Optional:

    • --annotated_output: Adds an extra column with each sentence's evaluation ("keep" if the sentence is good, otherwise the reason for rejecting (default: False)
    • --metadata METADATA: Training metadata (YAML file), generated by bicleaner-train or downloaded as a part of a language pack. You just need to untar the language pack for the pair of languages that you want to clean. The tar file contains the YAML metadata file. There's a script that can download and unpack available language packs. As an example, if you are planning to clean an English to Czeck file, use:
    $ ./utils/download-pack.sh en cs ./models
    

    to download the English-Czech language pack to the ./models directory and unpack it.

    • -S SOURCE_TOKENIZER_COMMAND: Source language tokenizer full command (including flags if needed). If not given, Sacremoses tokenizer is used (with escape=False option).
    • -T TARGET_TOKENIZER_COMMAND: Target language tokenizer full command (including flags if needed). If not given, Sacremoses tokenizer is used (with escape=False option).
    • --scol SCOL: Source sentence column (starting in 1) (default: 3)
    • --tcol TCOL: Target sentence column (starting in 1) (default: 4)
    • --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)
    • -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)
    • -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)
    • --lm_threshold LM_THRESHOLD: Threshold for language model fluency scoring. All sentence pairs whose LM fluency score falls below the threshold are removed (classifier score set to 0), unless the option --keep_lm_result is set. (default: 0.5)
    • -A or --run_all: Run all rules for each sentence instead of stopping at first discard (default: False)
    • -c CONFIG.yml or --config CONFIG.yml: Rules configuration file (default: None)
    • --disable_hardrules: Disables the bicleaner_hardrules filtering (only bicleaner_classify is applied) (default: False)
    • --disable_lm_filter: Disables LM filtering.
    • --disable_porn_removal: Disables porn removal.
    • --disable_minimal_length : Don't apply minimal length rule (default: False).
    • -h, --help: show this help message and exit
  • Logging:

    • -q, --quiet: Silent logging mode (default: False)
    • --debug: Debug logging mode (default: False)
    • --logfile LOGFILE: Store log to a file (default: stderr)
    • -v, --version: show version of this script and exit

Example

bicleaner-hardrules  \
        corpus.en-es.raw  \
        corpus.en-es.classifed

This will read the "corpus.en-es.raw" file, tag it and write the resul in corpus.classified. Each line of the new file will contain the same content as the input file, adding a column with the tag given by the Bicleaner hard-rules.

Automatic test

We included a small test corpus and a script to check that your Bicleaner classifier is working as expected. In order to use it, just run:

python3.7 -m pytest -s tests/hardrules_test.py

This will download the required language pack, classify the provided test corpus, and check the resulting classification scores. If everything went as expected, the output will be "1 passed in XX.XX seconds". All downloaded data will be removed at the end of the testing session.

Understanding annotated output

When using the --annotated_output flag, an extra column with each sentence's evaluation is added to the output. If the evalution is keep, it means that the sentence is good and passed all filters. Any other value in the extra column means that the sentence should be rejected, indicating the reason why. See below the list of posible rejecting values and their meanings:

no_empty	Sentence is empty
not_too_long	Sentence is more than 1024 characters long
not_too_short	Sentence is less than	3 words long
length_ratio	The length ratio between the source sentence and target sentence (in bytes) is too low or too high
no_identical	Alphabetic content in source sentence and target sentence is identical
no_literals  Unwanted literals: "Re:","{{", "%s", "}}", "+++", "***", '=\"'
no_only_numbers	The ratio of numeric characters in source sentence is too high
no_urls	There are URLs
no_breadcrumbs	There are more than 2 breadcrumb characters in the sentence
no_glued_words	There are words in the sentence containing too many uppercased characters between lowercased characters
no_repeated_words There are words repeated consecutively
no_unicode_noise	Too many characters from unwanted unicode in source sentence
no_space_noise	Too many consecutive spaces in sentence
no_paren	Too many parenthesis or brackets in sentence
no_escaped_unicode	There is unescaped unicode characters in sentence
no_bad_encoding	Source sentence or target sentence contains mojibake
no_titles	All words in source sentence or target sentence are uppercased or in titlecase
no_wrong_language	Sentence is not in the desired language
no_porn	Source sentence or target sentence contains text identified as porn
lm_filter	The sentence pair has low fluency score from the language model

Training classifiers

In case you need to train a new classifier (i.e. because it is not available in the language packs provided at bicleaner-data), you can use bicleaner-train . bicleaner-train is a Python3 tool that allows you to train a classifier which predicts whether a pair of sentences are mutual translations or not and discards too noisy sentence pairs. Visit our Wiki for a detailed example on Bicleaner training.

Citation

If you find Bicleaner useful, please consider citing the following papers:

V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez,
"Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task",
in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers.
Brussels, Belgium: Association for Computational Linguistics, October 2018

@InProceedings{prompsit:2018:WMT,
  author    = { V\'{i}ctor M. S\'{a}nchez-Cartagena and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas and Gema Ram\'{i}rez-S\'{a}nchez},
  title     = {Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task},
  booktitle = {Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {October},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics}
}

Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas
"Bifixer and Bicleaner: two open-source tools to clean your parallel data.",
in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation.
Lisboa, Portugal: European Association for Machine Translation, November 2020

@InProceedings{prompsit:2020:EAMT,
  author    = {Gema Ram\'{i}rez-S\'{a}nchez and Jaume Zaragoza-Bernabeu and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas},
  title     = {Bifixer and Bicleaner: two open-source tools to clean your parallel data.},
  booktitle = {Proceedings of the 22nd Annual Conference of the European Association for Machine Translation},
  pages	    = {291--298},
  isbn      = {978-989-33-0589-8},
  year	    = {2020},
  month     = {November},
  address   = {Lisboa, Portugal},
  publisher = {European Association for Machine Translation}
}

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicleaner-hardrules-2.1.1.tar.gz (36.7 kB view details)

Uploaded Source

Built Distribution

bicleaner_hardrules-2.1.1-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file bicleaner-hardrules-2.1.1.tar.gz.

File metadata

  • Download URL: bicleaner-hardrules-2.1.1.tar.gz
  • Upload date:
  • Size: 36.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.9

File hashes

Hashes for bicleaner-hardrules-2.1.1.tar.gz
Algorithm Hash digest
SHA256 527c296bbac5e5cae0e851444f661e0e3f062f381335de0f80092b43298bdf30
MD5 a104e3a63c248b4e6a645a541c4e0f21
BLAKE2b-256 9302bf8db4469890de056078301b9ae168b9bd53eaff96fb32a6349ad49e88a8

See more details on using hashes here.

File details

Details for the file bicleaner_hardrules-2.1.1-py3-none-any.whl.

File metadata

  • Download URL: bicleaner_hardrules-2.1.1-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.9

File hashes

Hashes for bicleaner_hardrules-2.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c3f5b2f93e90efcaeaf7854129996b2593fd482de500550e62a3d940cccee8fc
MD5 8ca1d473dcfc0e620764fa261b47999c
BLAKE2b-256 726e0776c5c6c661327cb42e322c1af3eec19f0814f55ac3dfeff226b4d8a366

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page