Skip to main content

Library to re-segment source sentences to match the segmentation of a reference

Project description

Source Resegmenter

source_resegmenter is a Python library for re-segmenting a text into lines in a way that it matches a reference text in another language.

The repository is tested using Python 3.11. Although it may work also with other Python versions, we do not ensure compatibility with them. Check out the Usage section for instructions on how to use the repository and the Installation section for further information about how to install the project.

Installation

You can install the latest stable version from PyPI:

pip install source_resegmenter

Or, to install from source:

git clone https://github.com/hlt-mt/source_resegmenter.git
cd source_resegmenter
pip install .

For development (with docs and testing tools):

pip install -e .[dev]

Usage

This library assumes that 3 txt files are available:

  1. The source text to be re-segmented, whose segmentation into lines has to be refined to match that of a reference file;
  2. The reference text, to which we want to obtain a line-level alignment of the source text;
  3. A backtranslation of the reference text into the source language, aligned at the line level with the reference text.

Once these three txt files are available, this tool can be used from command line as:

source_resegmenter --source-texts asr_audio_1.en --reference-texts audio_1_ref.it \
    --backtranslation-texts mt_audio_1_ref.en --output resegm_audio_1.en

Contributing

Contributions from interested researchers and developers are extremely appreciated.

You can create an issue in case of problems with the code, questions, or feature requests. You are also more than welcome to create a pull request that addresses any issue.

Licence

source_resegmenter_ is licensed under Apache Version 2.0.

Credits

If you use this library, please cite:

@inproceedings{cettolo-et-al-2025-xlr-segmenter,
    title={{How to Evaluate Speech Translation with Source-Aware Neural MT Metrics}},
    author={Cettolo, Mauro and Gaido, Marco and Negri, Matteo and Papi, Sara and Bentivogli, Luisa},
    booktitle = "",
    address = "",
    year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

source_resegmenter-1.0.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

source_resegmenter-1.0.0-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file source_resegmenter-1.0.0.tar.gz.

File metadata

  • Download URL: source_resegmenter-1.0.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for source_resegmenter-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1c7189c15d740d5a28d6aa83076c6585650bd6ee82e1122eb62a2c25186460e4
MD5 eec3df54a7bc2a84a39f6cd3f1486b67
BLAKE2b-256 f03a40bf65c502ae6e5666ff00422eec1a7f3d0538af624be8905cbebff29bc4

See more details on using hashes here.

File details

Details for the file source_resegmenter-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for source_resegmenter-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2166035a06db241d0097ae63735e90eea6b503bd1e2032507d9698477e24e067
MD5 7f9c7900a53d60ebb976e1d727cbec30
BLAKE2b-256 2c0ba01d43660a40761daab11a09ed2961ce6e812dfe3cbc0d6b4bd39cb85240

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page