Parallel corpus classifier, indicating the likelihood of a pair of sentences being mutual translations or not (neural version)

These details have not been verified by PyPI

Project description

Bicleaner AI

License

Bicleaner AI (bicleaner-ai-classify) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

Although a training tool (bicleaner-ai-train) is provided, you may want to use the available ready-to-use language packages. Please, use bicleaner-ai-download to download the latest language packages or visit the Github releases for lite models and Hugging Face Hub for full models since v2.0. Visit our docs for a detailed example on Bicleaner training.

If you find Bicleaner AI useful, please consider citing us.

What is New?

v3.4 Hardrules is now optional

Bicleaner Hardrules package is now optional and training features are also optional. Please see below under installation options on how to proceed.

v3.0.0 Improving Multilinguality!

New improved multilingual models for zero-shot classification.

Previous news

v2.0.0, March 10, 2023

Model accuracy improvements and HF integration! See CHANGELOG.

v1.0.0, June 6 2021

Bicleaner AI is a Bicleaner fork that uses neural networks. It comes with two types of models, lite models for fast scoring and full models for high performance. Lite models use A Decomposable Attention Model for Natural Language Inference (Parikh et al.). Full models use fine-tuned XLMRoberta (Unsupervised Cross-lingual Representation Learning at Scale).

The use of XLMRoberta and 1:10 positive to negative ratio were inspired in the winner of WMT20 Parallel Corpus Filtering Task paper (Filtering noisy parallel corpus using transformers with proxy task learning).

Installation & Requirements

Python >= 3.8
PIP >= 23.0
CUDA >=11.2 (for training and inference with full models)

Minimal installation

Bicleaner AI is written in Python and can be installed using pip.

pip install bicleaner-ai

Note this installation will need --disable_hardrules option when running bicleaner-ai-classify, otherwise install hardrules feature.

Optional features

Since version 3.4 Hardrules and training features are now optional. If you only use Bicleaner AI classifier and disabled hardrules, you can skip optional features.

All features

To install all features, just as it was installed before version 3.4, follow Hardrules steps previous to installation and then install Bicleaner AI with

bicleaner-ai[all]

CUDA

Bicleaner (TensorFlow) uses the CUDA installation on the system by default, but can be installed in the Python environment if it's more convenient.

pip install bicleaner-ai[and-cuda]

This method, although it makes installation more heavy, is more reliable regarding TensorFlow and CUDA compatibility and also more convenient if the user does not have system permissions.

Hardrules

If you need Bicleaner Hardrules, or want to use Bicleaner AI default options as it was before 3.4, install hardrules feature. It also requires the Python bindings with support for 7-gram language models. Hardrules uses KenLM and FastSpell that requires cyhunspell to be installed manually. You can install all the requirements by running the following commands:

sudo apt install python3-dev build-essential autoconf autopoint libtool
pip install git+https://github.com/MSeal/cython_hunspell@2.0.3
pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip
pip install bicleaner-ai[hardrules]

Serbo-Croatian transliteration

For Serbo-Croatian languages, models work better with transliteration. To be able score transliterated text, install optional feature:

pip install bicleaner-ai[transliterate]

Note that this won't transliterate the output text, it will be used only for scoring.

Train

If you want to train models please install train feature

pip install bicleaner-ai[train]

After installation, three binary files (bicleaner-ai-train, bicleaner-ai-classify, bicleaner-ai-download) will be located in your python/installation/prefix/bin directory. This is usually $HOME/.local/bin or /usr/local/bin/.

TensorFlow

TensorFlow 2 will be installed as a dependency and GPU support is required for training. pip will install latest TensorFlow supported version, but older versions >=2.16 are supported and can be installed if your machine does not meet TensorFlow CUDA requirements. See this table for the CUDA and TensorFlow versions compatibility. In case you want a different TensorFlow version, you can downgrade using:

pip install tensorflow==2.16

TensorFlow logging messages are suppressed by default, in case you want to see them you have to explicitly set TF_CPP_MIN_LOG_LEVEL environment variable. For example:

TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-classify

WARNING: If you are experiencing slow downs because Bicleaner AI is not running in the GPU, you should check those logs to see if TensorFlow is loading all the libraries correctly.

Cleaning

Getting started

bicleaner-ai-classify aims at detecting noisy sentence pairs in a parallel corpus. It indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.

By default, the input file (the parallel corpus to be classified) expects at least four columns, being:

col1: URL 1
col2: URL 2
col3: Source sentence
col4: Target sentence

but the source and target sentences column index can be customized by using the --scol and --tcol flags. Urls are not mandatory.

The generated output file will contain the same lines and columns that the original input file had, adding an extra column containing the Bicleaner AI classifier score.

Download a model

Bicleaner AI has two types of models, full and lite models. Full models are recommended, as they provide much higher quality. If speed is a hard constraint to you, lite models could be an option (take a look at the speed comparison).

See available full models here and available lite models here.

You can download the model with:

bicleaner-ai-download en fr full

This will download bitextor/bicleaner-ai-full-en-fr model from HuggingFace and store it at the cache directory.

Or you can download a lite model with:

bicleaner-ai-download en fr lite ./bicleaner-models

This will download and store the en-fr lite model at ./bicleaner-models/en-fr.

Since 2.3.0 version, full models also accept a local path to download, instead of the HF cache directory. In that case, to use the model, provide the local path instead of the HF identifier.

To read more information about how HF cache works, please read the official documentation.

Classifying

To classify a tab separated file containing English sentences in the first column and French sentences in the second column, use

bicleaner-ai-classify  \
        --scol 1 --tcol 2
        corpus.en-fr.tsv  \
        corpus.en-fr.classifed.tsv  \
        bitextor/bicleaner-ai-full-en-fr

where --scol and --tcol indicate the location of source and target sentence, corpus.en-fr.tsv the input file, corpus.en-fr.classified.tsv output file and bitextor/bicleaner-ai-en-fr is the HuggingFace model name. Each line of the new file will contain the same content as the input file, adding a column with the score given by the Bicleaner AI classifier.

Note that, to use a lite model, you need to provide model path in your local file system, instead of HuggingFace model name.

Multilingual models

There are multilingual full models available. They can work with, potentially, any language (currently only paired with English) that XLMR supports. To see a further explaination on how to train a multilingual model or how our models perform, take a look here and here.

WARNING: multilingual models will disable hardrules that expect language parameter. You can, however, overwrite the language code in the model configuration with -s/--source_lang or -t/--target_lang options during classify. For example when scoring English-Icelandic data, use:

bicleaner-ai-classify \
    --scol 1 --tcol 2 \
    -t is \
    corpus.en-is.tsv \
    corpus.en-is.classified.tsv \
    bitextor/bicleaner-ai-full-en-xx

Usage

Full description of the command-line parameters:

usage: bicleaner-ai-classify [-h] [-s SOURCE_LANG] [-t TARGET_LANG] [-S SOURCE_TOKENIZER_COMMAND] [-T TARGET_TOKENIZER_COMMAND] [--header] [--scol SCOL] [--tcol TCOL] [-b BLOCK_SIZE] [-p PROCESSES] [--batch_size BATCH_SIZE]
                             [--tmp_dir TMP_DIR] [--score_only] [--calibrated] [--raw_output] [--lm_threshold LM_THRESHOLD] [--disable_hardrules] [--disable_lm_filter] [--disable_porn_removal] [--disable_minimal_length]
                             [--run_all_rules] [--rules_config RULES_CONFIG] [--offline] [--auth_token AUTH_TOKEN] [-q] [--debug] [--logfile LOGFILE] [-v]
                             input [output] model

positional arguments:
  input                 Tab-separated files to be classified
  output                Output of the classification (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>)
  model                 Path to model directory or HuggingFace Hub model identifier (such as 'bitextor/bicleaner-ai-full-en-fr')

options:
  -h, --help            show this help message and exit

Optional:
  -s SOURCE_LANG, --source_lang SOURCE_LANG
                        Overwrite model config source language (default: None)
  -t TARGET_LANG, --target_lang TARGET_LANG
                        Overwrite model config target language (default: None)
  -S SOURCE_TOKENIZER_COMMAND, --source_tokenizer_command SOURCE_TOKENIZER_COMMAND
                        Source language (SL) tokenizer full command (default: None)
  -T TARGET_TOKENIZER_COMMAND, --target_tokenizer_command TARGET_TOKENIZER_COMMAND
                        Target language (TL) tokenizer full command (default: None)
  --header              Input file will be expected to have a header, and the output will have a header as well (default: False)
  --scol SCOL           Source sentence column (starting in 1). The name of the field is expected instead of the position if --header is set (default: 3)
  --tcol TCOL           Target sentence column (starting in 1). The name of the field is expected instead of the position if --header is set (default: 4)
  -b BLOCK_SIZE, --block_size BLOCK_SIZE
                        Sentence pairs per block (default: 10000)
  -p PROCESSES, --processes PROCESSES
                        Option no longer available, please set BICLEANER_AI_THREADS environment variable (default: None)
  --batch_size BATCH_SIZE
                        Sentence pairs per block (default: 32)
  --tmp_dir TMP_DIR     Temporary directory where creating the temporary files of this program (default: /tmp)
  --score_only          Only output one column which is the bicleaner score (default: False)
  --calibrated          Output calibrated scores (default: False)
  --raw_output          Return raw output without computing positive class probability. (default: False)
  --lm_threshold LM_THRESHOLD
                        Threshold for language model fluency scoring. All TUs whose LM fluency score falls below the threshold will are removed (classifier score set to 0), unless the option --keep_lm_result set. (default: 0.5)
  --disable_hardrules   Disables the bicleaner_hardrules filtering (only bicleaner_classify is applied) (default: False)
  --disable_lm_filter   Disables LM filtering (default: False)
  --disable_porn_removal
                        Don't apply porn removal (default: False)
  --disable_minimal_length
                        Don't apply minimal length rule (default: False)
  --run_all_rules       Run all rules of Hardrules instead of stopping at first discard (default: False)
  --rules_config RULES_CONFIG
                        Hardrules configuration file (default: None)
  --offline             Don't try to download the model, instead try directly to load from local storage (default: False)
  --auth_token AUTH_TOKEN
                        Auth token for the Hugging Face Hub (default: None)

Logging:
  -q, --quiet           Silent logging mode (default: False)
  --debug               Debug logging mode (default: False)
  --logfile LOGFILE     Store log to a file (default: <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>)
  -v, --version         Show version of the package and exit

Training models

Bicleaner AI provides a command-line tool to train your own model, in case available models do not fit your needs. Please go to our training documentation for a quick start and further details.

Setting the number of threads

To set the maximum number of threads/processes to be used during training or classifying, --processes option is no longer available. You will need to set BICLEANER_AI_THREADS environment variable to the desired value. For example:

BICLEANER_AI_THREADS=12 bicleaner-ai-classify ...

If the variable is not set, the program will use all the available CPU cores.

Speed

A comparison of the speed in number of sentences per second between different types of models and hardware:

model	speed CPUx1	speed GPUx1
full	1.78 rows/sec	200 rows/sec
lite	600 rows/sec	10,000 rows/sec

CPU: Intel Core i9-9960X single core (lite model batch 16, full model batch 1)
GPU: Nvidia V100 (lite model batch 2048, full model batch 16)

Citation

J. Zaragoza-Bernabeu, M. Bañón, G. Ramírez-Sánchez, S. Ortiz-Rojas,
"Bicleaner AI: Bicleaner Goes Neural",
in Proceedings of the 13th Language Resources and Evaluation Conference.
Marseille, France: Language Resources and Evaluation Conference, June 2022

@inproceedings{zaragoza-bernabeu-etal-2022-bicleaner,
    title = {"Bicleaner {AI}: Bicleaner Goes Neural"},
    author = {"Zaragoza-Bernabeu, Jaume  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Ba{\~n}{\'o}n, Marta  and
      Ortiz Rojas, Sergio"},
    booktitle = {"Proceedings of the Thirteenth Language Resources and Evaluation Conference"},
    month = jun,
    year = {"2022"},
    address = {"Marseille, France"},
    publisher = {"European Language Resources Association"},
    url = {"https://aclanthology.org/2022.lrec-1.87"},
    pages = {"824--831"},
    abstract = {"This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees."},
}

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

3.6

Feb 23, 2026

3.5

Jan 29, 2026

3.4

Jan 19, 2026

3.3.1

Jan 16, 2026

3.3.0

Jun 12, 2025

3.2.1

May 16, 2025

3.2.0

May 14, 2025

3.1.0

Jul 26, 2024

3.0.1

Apr 16, 2024

3.0.0

Apr 16, 2024

2.3.2

Aug 21, 2023

2.3.1

Aug 9, 2023

2.3

Jul 11, 2023

2.2.2

Jun 9, 2023

2.2.1

May 8, 2023

2.2.0

Mar 27, 2023

2.1.1

Mar 14, 2023

2.1.0

Mar 10, 2023

2.0.0

Mar 10, 2023

2.0.0rc2 pre-release

Jan 18, 2023

1.0.1

Jun 16, 2021

1.0

Jun 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicleaner_ai-3.6.tar.gz (83.8 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bicleaner_ai-3.6-py3-none-any.whl (76.0 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file bicleaner_ai-3.6.tar.gz.

File metadata

Download URL: bicleaner_ai-3.6.tar.gz
Upload date: Feb 23, 2026
Size: 83.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bicleaner_ai-3.6.tar.gz
Algorithm	Hash digest
SHA256	`bc136d8c72779b793790419154273b1ecb93d100f94dc336ebf20463f30b4a2e`
MD5	`57f1dd4c16f8c901b9b72ff3defd117f`
BLAKE2b-256	`ba6939055fb5bbb3cdd2ff8f9a6280e2fe9f9ffc69422d67f7bdbcfd6f2debf6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bicleaner_ai-3.6.tar.gz:

Publisher: pypi-publish.yml on bitextor/bicleaner-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bicleaner_ai-3.6.tar.gz
- Subject digest: bc136d8c72779b793790419154273b1ecb93d100f94dc336ebf20463f30b4a2e
- Sigstore transparency entry: 982859018
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: bitextor/bicleaner-ai@f05a4abc49910ad9b4333dbfaa03d5be0b9472a0
- Branch / Tag: refs/tags/v3.6.0
- Owner: https://github.com/bitextor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@f05a4abc49910ad9b4333dbfaa03d5be0b9472a0
- Trigger Event: release

File details

Details for the file bicleaner_ai-3.6-py3-none-any.whl.

File metadata

Download URL: bicleaner_ai-3.6-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 76.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bicleaner_ai-3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aee587cb64105ff21d1c3eccc2289b441a31626b7173add8dd42ae06649dfd8d`
MD5	`723c99327d4c616f51d7f07fa41d9c63`
BLAKE2b-256	`beeb903a6c0c27fc07ce43a34bd062e70b437cdd3ec42b92f2081279d130a43a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for bicleaner_ai-3.6-py3-none-any.whl:

Publisher: pypi-publish.yml on bitextor/bicleaner-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: bicleaner_ai-3.6-py3-none-any.whl
- Subject digest: aee587cb64105ff21d1c3eccc2289b441a31626b7173add8dd42ae06649dfd8d
- Sigstore transparency entry: 982859041
- Sigstore integration time: Feb 23, 2026
Source repository:
- Permalink: bitextor/bicleaner-ai@f05a4abc49910ad9b4333dbfaa03d5be0b9472a0
- Branch / Tag: refs/tags/v3.6.0
- Owner: https://github.com/bitextor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@f05a4abc49910ad9b4333dbfaa03d5be0b9472a0
- Trigger Event: release

bicleaner-ai 3.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Bicleaner AI

What is New?

v3.4 Hardrules is now optional

v3.0.0 Improving Multilinguality!

v2.0.0, March 10, 2023

v1.0.0, June 6 2021

Installation & Requirements

Minimal installation

Optional features

All features

CUDA

Hardrules

Serbo-Croatian transliteration

Train

TensorFlow

Cleaning

Getting started

Download a model

Classifying

Multilingual models

Usage

Training models

Setting the number of threads

Speed

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance