Skip to main content

Library to re-segment source sentences to match the segmentation of a reference

Project description

Source Resegmenter

source_resegmenter is a Python library for re-segmenting a text into lines in a way that it matches a reference text in another language.

The repository is tested using Python 3.11. Although it may work also with other Python versions, we do not ensure compatibility with them. Check out the Usage section for instructions on how to use the repository and the Installation section for further information about how to install the project.

Installation

You can install the latest stable version from PyPI:

pip install source_resegmenter

Or, to install from source:

git clone https://github.com/hlt-mt/source_resegmenter.git
cd source_resegmenter
pip install .

For development (with docs and testing tools):

pip install -e .[dev]

Usage

This library assumes that 3 txt files are available:

  1. The source text to be re-segmented, whose segmentation into lines has to be refined to match that of a reference file;
  2. The reference text, to which we want to obtain a line-level alignment of the source text;
  3. A backtranslation of the reference text into the source language, aligned at the line level with the reference text.

Once these three txt files are available, this tool can be used from command line as:

source_resegmenter --source-texts asr_audio_1.en --reference-texts audio_1_ref.it \
    --backtranslation-texts mt_audio_1_ref.en --output resegm_audio_1.en

Contributing

Contributions from interested researchers and developers are extremely appreciated.

You can create an issue in case of problems with the code, questions, or feature requests. You are also more than welcome to create a pull request that addresses any issue.

Licence

source_resegmenter_ is licensed under Apache Version 2.0.

Credits

If you use this library, please cite:

@inproceedings{cettolo-et-al-2025-xlr-segmenter,
    title={{How to Evaluate Speech Translation with Source-Aware Neural MT Metrics}},
    author={Cettolo, Mauro and Gaido, Marco and Negri, Matteo and Papi, Sara and Bentivogli, Luisa},
    booktitle = "",
    address = "",
    year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

source_resegmenter-0.1.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

source_resegmenter-0.1.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file source_resegmenter-0.1.0.tar.gz.

File metadata

  • Download URL: source_resegmenter-0.1.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for source_resegmenter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 11481bd3b47c272833fdf9616c08e515db1d7894bface42d85751f4590481bfb
MD5 2b86d93f75dacb1d9a628ef06deb861b
BLAKE2b-256 8909eace6c34c35c2220870749e32382f631171c5f22aeceb08f5a57fb17abd2

See more details on using hashes here.

File details

Details for the file source_resegmenter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for source_resegmenter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 374b59481f578276fd3dd8c2f993341205e5c0804272bbafacd5247982db9bd1
MD5 13be40e5ac6a9bb21ee57af4ce0f39db
BLAKE2b-256 d867a05d6710ea1bb3a1db379f1031c3f7c0168365a02f8e39904246420543b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page