Library to re-segment source sentences to match the segmentation of a reference
Project description
Source Resegmenter
source_resegmenter is a Python library for re-segmenting a text into lines in a way
that it matches a reference text in another language.
The repository is tested using Python 3.11. Although it may work also with other Python versions, we do not ensure compatibility with them. Check out the Usage section for instructions on how to use the repository and the Installation section for further information about how to install the project.
Installation
You can install the latest stable version from PyPI:
pip install source_resegmenter
Or, to install from source:
git clone https://github.com/hlt-mt/source_resegmenter.git
cd source_resegmenter
pip install .
For development (with docs and testing tools):
pip install -e .[dev]
Usage
This library assumes that 3 txt files are available:
- The source text to be re-segmented, whose segmentation into lines has to be refined to match that of a reference file;
- The reference text, to which we want to obtain a line-level alignment of the source text;
- A backtranslation of the reference text into the source language, aligned at the line level with the reference text.
Once these three txt files are available, this tool can be used from command line as:
source_resegmenter --source-texts asr_audio_1.en --reference-texts audio_1_ref.it \
--backtranslation-texts mt_audio_1_ref.en --output resegm_audio_1.en
Contributing
Contributions from interested researchers and developers are extremely appreciated.
You can create an issue in case of problems with the code, questions, or feature requests. You are also more than welcome to create a pull request that addresses any issue.
Licence
source_resegmenter_ is licensed under Apache Version 2.0.
Credits
If you use this library, please cite:
@inproceedings{cettolo-et-al-2025-xlr-segmenter,
title={{How to Evaluate Speech Translation with Source-Aware Neural MT Metrics}},
author={Cettolo, Mauro and Gaido, Marco and Negri, Matteo and Papi, Sara and Bentivogli, Luisa},
booktitle = "",
address = "",
year={2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file source_resegmenter-1.0.0.tar.gz.
File metadata
- Download URL: source_resegmenter-1.0.0.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c7189c15d740d5a28d6aa83076c6585650bd6ee82e1122eb62a2c25186460e4
|
|
| MD5 |
eec3df54a7bc2a84a39f6cd3f1486b67
|
|
| BLAKE2b-256 |
f03a40bf65c502ae6e5666ff00422eec1a7f3d0538af624be8905cbebff29bc4
|
File details
Details for the file source_resegmenter-1.0.0-py3-none-any.whl.
File metadata
- Download URL: source_resegmenter-1.0.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2166035a06db241d0097ae63735e90eea6b503bd1e2032507d9698477e24e067
|
|
| MD5 |
7f9c7900a53d60ebb976e1d727cbec30
|
|
| BLAKE2b-256 |
2c0ba01d43660a40761daab11a09ed2961ce6e812dfe3cbc0d6b4bd39cb85240
|