Skip to main content

Machine Translation Corpus Cleaning and Processing

Project description

MTCleanse: Machine Translation Corpus Cleaning

PyPI version Python versions

MTCleanse is a powerful, state-of-the-art toolkit designed for cleaning and preprocessing parallel corpora to be used for neural machine translation (NMT) systems. Built for researchers, language technologists, and MT practitioners, it addresses the critical "garbage in, garbage out" problem that plagues many translation models.

By systematically removing noise, detecting misalignments, filtering problematic sentence pairs, and handling outliers, MTCleanse significantly improves the quality of training data, leading to more accurate, robust, and reliable translation models.

Features

  • Clean parallel text datasets with configurable parameters
  • Remove noise such as URLs, emails, and control characters
  • Filter texts based on length constraints
  • Detect and remove statistical outliers
  • Domain-based filtering using sentence embeddings
  • Export cleaned data in various formats (text files, JSON)
  • Comprehensive statistics on the cleaning process

Installation

pip install mtcleanse

Or install from source:

git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
pip install -e .

Quick Start

from mtcleanse.cleaning import ParallelTextCleaner

# Initialize with default settings
cleaner = ParallelTextCleaner()

# Clean parallel text files
cleaner.clean_files(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr"
)

# Or clean text directly
source_texts = ["Hello world", "This is a test"]
target_texts = ["Bonjour le monde", "C'est un test"]
clean_source, clean_target = cleaner.clean_texts(source_texts, target_texts)

Command Line Interface

MTCleanse also provides a command-line interface:

mtcleanse-clean --source source.en --target target.fr --output-source clean_source.en --output-target clean_target.fr

Configuration

You can customize the cleaning process with various parameters:

cleaner = ParallelTextCleaner({
    "min_chars": 10,
    "max_chars": 500,
    "min_words": 3,
    "max_words": 50,
    "enable_domain_filtering": True,
    "domain_contamination": 0.2
})

# This method returns the cleaned data and the statistics
clean_source, clean_target, stats = cleaner.clean_texts(
    source_texts=["Hello world", "This is a test"],
    target_texts=["Bonjour le monde", "C'est un test"]
)

# This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
    source_file="source.en",
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr",
    html_report="report.html"
)

Development

Setting up the development environment

# Clone the repository
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse

# Install in development mode with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pip install pre-commit
pre-commit install

# Run tests
pytest tests/ --cov=mtcleanse

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtcleanse-0.2.1.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mtcleanse-0.2.1-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file mtcleanse-0.2.1.tar.gz.

File metadata

  • Download URL: mtcleanse-0.2.1.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mtcleanse-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4d70b84d6f8950e82ed6c0b3f73b59d1e6fdde571692c192b94d4c842ced5a3c
MD5 299a653cd22f5b1bd123f4848054a1dc
BLAKE2b-256 dc9c577e40bf73285ba2cd1694ff13a1ca379654aa2f5ba2e38d893fb33cec07

See more details on using hashes here.

File details

Details for the file mtcleanse-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mtcleanse-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mtcleanse-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3bb65285dc78c647ff5a4bcfbd628eb9b0404d714157c29fcbab832f691bbbb1
MD5 7abaea717ded25d758b22b4ae9ffa5c6
BLAKE2b-256 467c1d27d09bc72855389cfd0e11478b88c6e14bbec43e4f7ea4ba4d26a604a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page