MUCH: A light text claim segmenter for hallucination detection.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MUCH-segmenter: A fast claim segmentation algorithm

This package implements much_segmenter, a fast, deterministic, and compute-efficient claim segmentation algorithm designed for English, French, Spanish, and German. This algorithm was introduced in our paper:

MUCH: A Multilingual Claim Hallucination Benchmark

Jérémie Dentan¹, Alexi Canesse¹, Davide Buscaldi^{1, 2}, Aymen Shabou³, Sonia Vanier¹

¹LIX (École Polytechnique, IP Paris, CNSR), ²LIPN (Université Sorbonne Paris Nord), ³Crédit Agricole SA

paper_link

Usage and example

The main function of this package is much_segmentation, which segments an LLM generation into token chunks.

Example

In this example, the LLM generation contains 12 tokens. Our claim segmentation algorithm splits this generation into 3 claims: the first contains tokens 0-3 ("No, Xining"), the second tokens 4-7 (" is the largest city"), and the last claim contains tokens 8-11 (" in Qinghai.").

# Imports
from much_segmenter import much_segmentation, get_repr_string
from transformers import AutoTokenizer

# Defining the generation and the tokenizer
generation = "No, Xining is the largest city in Qinghai."
llm_tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")

# Segmentation
token_chunks = much_segmentation(generation, llm_tokenizer)
print(token_chunks) # Should be [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]

# Display of the result
print(get_repr_string(generation, token_chunks, tokenizer=llm_tokenizer))

# Output should be:
"""
<Segmentation>
# 0 : No, Xining
# 1 :  is the largest city
# 2 :  in Qinghai.
"""

Pre-computed tokens

Modern tokenizers are not idempotent. For example, the LLM can generate a sequence of output_tokens that are decoded into generation = tokenizer.decode(output_tokens). However, it is possible that sometimes tokenizer.encode(generation) != output_tokens. This can happen because the same text can be encoded in several ways, and the path chosen by the tokenizer may differ from the one obtained during LLM generation.

This behavior can be problematic because the output of much_segmenter is token indices, so any mismatch between the output of much_segmenter and the tokens generated by the LLM can lead to computation errors. Consequently, much_segmenter includes an optional precomputed_tokens, which should contain the output tokens as generated by the LLM.

⚠️ This optional parameter should ALWAYS be used when the output tokens are known, to avoid any token mismatch during segmentation ⚠️

Pseudo-code and algorithmic details

Our segmentation algorithm is fully rule-based and does not require external models or internet access, making it suitable for offline or computation-limited use cases. It is designed for English, French, Spanish, and German. We retain only these four European languages because their stopword and punctuation systems are similar. We expect our segmenter to be easily adaptable to languages with similar punctuation and stopwords, although we have not tested it beyond the four languages mentioned.

Our algorithm includes two main steps. First, we split the LLM generation into words using an external word tokenizer, and we use these words to identify the character indices of claim starts. Second, we map these character indices to the tokens of the LLM generation. For a detailed presentation of this algorithm and a discussion of its pseudo-code, please refer to our research paper available on arXiv: paper_link.

Runtime

This claim segmentation algorithm was designed to be extremely fast. Segmenting the entire MUCH dataset took 6s. This dataset includes 4,873 samples containing a total of 392,022 characters, representing 101,917 output tokens that were segmented into 25,624 claims (20,751 claims after removing the final claims containing only the EOS token). For reference, the LLM generation runtime for these samples was 2,758s, meaning that segmentation represents only a 0.2% overhead.

These runtimes are single-process and single-thread measurements; segmentation can be further accelerated with parallel computing.

Related artifacts

This package is released alongside the MUCH benchmark. This benchmark includes the following resources, which you might explore to see applications of our claim segmentation algorithm:

Our research paper introducing MUCH and describing its generation in detail: paper_link
A GitHub repository implementing the generation and utilization of MUCH: https://github.com/orailix/much
The dataset, available on HuggingFace:
- Main dataset: orailix/MUCH
- Generation configs: orailix/MUCH-configs
- Baseline evaluation data: orailix/MUCH-signals

Acknowledgement

This work received financial support from the research chair Trustworthy and Responsible AI at École Polytechnique.

This work was granted access to the HPC resources of IDRIS under the allocation AD011014843R1, made by GENCI.

Copyright and License

This repository is released under the Apache-2.0 license.

Please cite this dataset as follows:

@misc{dentan_much_2025,
  title = {MUCH: A Multilingual Claim Hallucination Benchmark},
  author = {Dentan, Jérémie and Canesse, Alexi and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia},
  year = {2025},
  note = {To appear},
}

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.1

Nov 24, 2025

This version

0.2.0

Nov 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

much_segmenter-0.2.0.tar.gz (14.5 kB view details)

Uploaded Nov 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

much_segmenter-0.2.0-py3-none-any.whl (12.5 kB view details)

Uploaded Nov 21, 2025 Python 3

File details

Details for the file much_segmenter-0.2.0.tar.gz.

File metadata

Download URL: much_segmenter-0.2.0.tar.gz
Upload date: Nov 21, 2025
Size: 14.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for much_segmenter-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`41018e471e2a0bd8246372ab76d49bea83d65717b4111b28c32417a2f260514e`
MD5	`47bb513dcc41538a3dd36f17d1e352a4`
BLAKE2b-256	`edc1ffadb3cda8292900ff5e9f6208685e7b5a9e63fb0db62f3c5e90040297bf`

See more details on using hashes here.

File details

Details for the file much_segmenter-0.2.0-py3-none-any.whl.

File metadata

Download URL: much_segmenter-0.2.0-py3-none-any.whl
Upload date: Nov 21, 2025
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for much_segmenter-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5837916ce1633a777acd2788c34454f090f964d5d3f2cadcf7c2923b3460f9b0`
MD5	`a06597db28521b18543caaed9a4cd217`
BLAKE2b-256	`d6bb6da04ede0f489e89a7bcc106d91e860a6f9eb93e034cf82286cf77e926bc`

See more details on using hashes here.

much-segmenter 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MUCH-segmenter: A fast claim segmentation algorithm

Usage and example

Example

Pre-computed tokens

Pseudo-code and algorithmic details

Runtime

Related artifacts

Acknowledgement

Copyright and License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes