Skip to main content

Monolingual corpus fluency filter

Project description

monocleaner

License

Monocleaner is a Python tool that aims to detect disfluent sentences in a monolingual corpus. Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency. In addition to a continuous score, several handwritten rules assign a score of 0 to obviously poor sentences.

Although a training tool (monocleaner-train) is provided, you may want to use the available ready-to-use language packages. Please, visit https://github.com/bitextor/monocleaner-data/releases/latest or use monocleaner-download to download the latest language packages.

Citation

If you find Monocleaner useful, please consider citing the following papers:

V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez,
"Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task",
in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers.
Brussels, Belgium: Association for Computational Linguistics, October 2018

@InProceedings{prompsit:2018:WMT,
  author    = { V\'{i}ctor M. S\'{a}nchez-Cartagena and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas and Gema Ram\'{i}rez-S\'{a}nchez},
  title     = {Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task},
  booktitle = {Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers},
  month     = {October},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics}
}

Installation & Requirements

Monocleaner can be installed using pip:

python3.7 -m pip install monocleaner

Monocleaner requires the KenLM Python bindings with support for 7-gram language models. You can easily install it by running the following commands:

git clone https://github.com/kpu/kenlm
cd kenlm
python3.7 -m pip install . --install-option="--max_order 7"
mkdir -p build && cd build
cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=/your/prefix/path
make -j all install

The remaining extra modules required by Monocleaner will be automatically downloaded and installed/upgraded (if required) with the first command.

After installation, two binary files (monocleaner-train and monocleaner) will be located in your python/installation/prefix/bin directory. This is usually $HOME/.local/bin or /usr/local/bin/.

Monocleaner uses FastSpell that requires python-dev and libhunspell-dev:

sudo apt install python-dev libhunspell-dev

Also note that Hunspell language packages must be installed by hand if you are going to work with one of languages listed as similar, i.e.:

sudo apt-get install hunspell-es

or downloaded from an external source, such as https://github.com/wooorm/dictionaries/tree/main/dictionaries

You can also provide the path to the Hunspell dictionaries directories by using the dictpath atribute in {/YOUR/INSTALLATION/PATH}/config/hunspell.yaml (for example, venv/lib/python3.7/site-packages/fastspell/config/hunspell.yaml ) if you are installing from PyPI or with setup.py, or in /config/hunspell.yaml if you are running directly the code. Default path is /usr/share/hunspell.

Scoring

monocleaner aims to detect disfluent sentences in a monolingual corpus. Each sentence is assigned a fluency score between 0 and 1, with higher scores indicating more fluency. In addition to a continuous score, several handwritten hardrules assign a score of 0 to obviously poor sentences.

The input file (monolingual corpus) must contain one sentence per line text. The generated output file will contain the same lines adding a column containing the Monocleaner fluency score.

This tool can be run with

monocleaner [-h]
            [--disable_minimal_length]
            [--disable_hardrules]
            [--score_only]
            [--annotated_output]
            [--add_lang_ident]
            [--debug]
            [-q]
            model_dir [input] [output]

If input and output are omitted, it will read from stdin and write to stdout.

Parameters

  • Positional arguments:
    • model_dir: Directory where the model is stored.
    • input: Input text file, one sentence per line. When omitted jointly with output, it will read from stdin.
    • output: Output tab-separated text file adding monocleaner score. When omitted output will be written to stdout.
  • Optional arguments:
    • --score_only: Only output one column which is the monocleaner score (default: False)
    • --add_lang_ident: Add another column with the identified language if it's not disabled.
    • --disable_hardrules: Disables the hardrules filtering (only monocleaner fluency scoring is applied) (default: False)
    • --disable_minimal_length : Don't apply minimal length rule (default: False).
  • Logging:
    • -q, --quiet: Silent logging mode (default: False)
    • --debug: Debug logging mode (default: False)
    • -v, --version: show version of this script and exit

Example

monocleaner models/es mono.es.txt mono.es.scored.txt

This will use the Spanish model located at models/es, read mono.es.txt file and write the sentences to mono.es.scored.txt adding the monocleaner score column.


Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

monocleaner-1.2.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

monocleaner-1.2-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file monocleaner-1.2.tar.gz.

File metadata

  • Download URL: monocleaner-1.2.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.13

File hashes

Hashes for monocleaner-1.2.tar.gz
Algorithm Hash digest
SHA256 34917feb9212d5168b17bfa1a8062f96586e2a20898b3652ebe3132bbc3f4a7c
MD5 3d55c6598c546bde1c032f79fe5e9fd5
BLAKE2b-256 bbeb7c5a3b2d06c9a03546a6277d03d7604c0117051977b5453f7e40cc8b37d9

See more details on using hashes here.

Provenance

File details

Details for the file monocleaner-1.2-py3-none-any.whl.

File metadata

  • Download URL: monocleaner-1.2-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.13

File hashes

Hashes for monocleaner-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fa0bcd1ebff3719df04141534747530e8b331f0137d9190208695a7135cd57b4
MD5 947bd2af21989bab5c269baa4a120af3
BLAKE2b-256 03cb569d5231c691589e278d4af41244ba8b7ba18bacebb2901e83cd3c30f0b7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page