Skip to main content

Machine Translation Corpus Cleaning and Processing

Project description

MTCleanse: Machine Translation Corpus Cleaning and Processing

MTCleanse is a Python library for cleaning and processing parallel text datasets, particularly useful for machine translation and other NLP tasks.

Features

  • Clean parallel text datasets with configurable parameters
  • Remove noise such as URLs, emails, and control characters
  • Filter texts based on length constraints
  • Detect and remove statistical outliers
  • Domain-based filtering using sentence embeddings
  • Export cleaned data in various formats (text files, JSON)
  • Comprehensive statistics on the cleaning process

Installation

pip install mtcleanse

Or install from source:

git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse
pip install -e .

Quick Start

from mtcleanse.cleaning import ParallelTextCleaner

# Initialize with default settings
cleaner = ParallelTextCleaner()

# Clean parallel text files
cleaner.clean_files(
    source_file="source.en", 
    target_file="target.fr",
    output_source="clean_source.en",
    output_target="clean_target.fr"
)

# Or clean text directly
source_texts = ["Hello world", "This is a test"]
target_texts = ["Bonjour le monde", "C'est un test"]
clean_source, clean_target = cleaner.clean_texts(source_texts, target_texts)

Command Line Interface

MTCleanse also provides a command-line interface:

mtcleanse-clean --source source.en --target target.fr --output-source clean_source.en --output-target clean_target.fr

Configuration

You can customize the cleaning process with various parameters:

cleaner = ParallelTextCleaner({
    "min_chars": 10,
    "max_chars": 500,
    "min_words": 3,
    "max_words": 50,
    "enable_domain_filtering": True,
    "domain_contamination": 0.2
})

Development

Setting up the development environment

# Clone the repository
git clone https://github.com/yourusername/mtcleanse.git
cd mtcleanse

# Install in development mode with development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ --cov=mtcleanse

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtcleanse-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mtcleanse-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file mtcleanse-0.1.0.tar.gz.

File metadata

  • Download URL: mtcleanse-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mtcleanse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5ab8cc740b5f7993ff8d3ae7fa0c826de565801602afa90f8e792953db7e29ae
MD5 d176cbb80cd2663e32e12279f5921e95
BLAKE2b-256 089638e12746d114d73a9c564c5d820608bfb9af38c0127b061574c3db1594ab

See more details on using hashes here.

File details

Details for the file mtcleanse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mtcleanse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for mtcleanse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 92cf3f61b7130cd2e6d938ae2bba12d9c851284790683cb34c27b31f097355c4
MD5 9172b661a2d2a5963c9e2492822eb449
BLAKE2b-256 1ea013d992362c39441bffd233a55ff22bebcdc1e3453d770c56329dd250ae4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page