Machine Translation Corpus Cleaning and Processing
Project description
MTCleanse: Machine Translation Corpus Cleaning
MTCleanse is a powerful, state-of-the-art toolkit designed for cleaning and preprocessing parallel corpora to be used for neural machine translation (NMT) systems. Built for researchers, language technologists, and MT practitioners, it addresses the critical "garbage in, garbage out" problem that plagues many translation models.
By systematically removing noise, detecting misalignments, filtering problematic sentence pairs, and handling outliers, MTCleanse significantly improves the quality of training data, leading to more accurate, robust, and reliable translation models.
Features
- Clean parallel text datasets with configurable parameters
- Remove noise such as URLs, emails, and control characters
- Filter texts based on length constraints
- Detect and remove statistical outliers
- Domain-based filtering using sentence embeddings
- Export cleaned data in various formats (text files, JSON)
- Comprehensive statistics on the cleaning process
Installation
pip install mtcleanse
Or install from source:
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
pip install -e .
Quick Start
from mtcleanse.cleaning import ParallelTextCleaner
# Initialize with default settings
cleaner = ParallelTextCleaner()
# Clean parallel text files
cleaner.clean_files(
source_file="source.en",
target_file="target.fr",
output_source="clean_source.en",
output_target="clean_target.fr"
)
# Or clean text directly
source_texts = ["Hello world", "This is a test"]
target_texts = ["Bonjour le monde", "C'est un test"]
clean_source, clean_target = cleaner.clean_texts(source_texts, target_texts)
Command Line Interface
MTCleanse also provides a command-line interface:
mtcleanse-clean --source source.en --target target.fr --output-source clean_source.en --output-target clean_target.fr
Configuration
You can customize the cleaning process with various parameters:
cleaner = ParallelTextCleaner({
"min_chars": 10,
"max_chars": 500,
"min_words": 3,
"max_words": 50,
"enable_domain_filtering": True,
"domain_contamination": 0.2
})
# This method returns the cleaned data and the statistics
clean_source, clean_target, stats = cleaner.clean_texts(
source_texts=["Hello world", "This is a test"],
target_texts=["Bonjour le monde", "C'est un test"]
)
# This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
source_file="source.en",
target_file="target.fr",
output_source="clean_source.en",
output_target="clean_target.fr",
html_report="report.html"
)
Development
Setting up the development environment
# Clone the repository
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
# Install in development mode with development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pip install pre-commit
pre-commit install
# Run tests
pytest tests/ --cov=mtcleanse
License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mtcleanse-0.2.1.tar.gz.
File metadata
- Download URL: mtcleanse-0.2.1.tar.gz
- Upload date:
- Size: 28.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d70b84d6f8950e82ed6c0b3f73b59d1e6fdde571692c192b94d4c842ced5a3c
|
|
| MD5 |
299a653cd22f5b1bd123f4848054a1dc
|
|
| BLAKE2b-256 |
dc9c577e40bf73285ba2cd1694ff13a1ca379654aa2f5ba2e38d893fb33cec07
|
File details
Details for the file mtcleanse-0.2.1-py3-none-any.whl.
File metadata
- Download URL: mtcleanse-0.2.1-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bb65285dc78c647ff5a4bcfbd628eb9b0404d714157c29fcbab832f691bbbb1
|
|
| MD5 |
7abaea717ded25d758b22b4ae9ffa5c6
|
|
| BLAKE2b-256 |
467c1d27d09bc72855389cfd0e11478b88c6e14bbec43e4f7ea4ba4d26a604a6
|