Granularity scoring for natural language
Project description
Granuscore
Granuscore is a Python library for measuring the semantic granularity of natural language text.
It provides an end-to-end pipeline that:
- splits text into referential units,
- assigns continuous granularity scores to each unit,
- aggregates these scores into document-level estimates.
Granuscore is designed for analyzing how fine-grained or coarse-grained textual expressions are in applications such as question answering, educational dialogue, summarization, and scientific writing.
Installation
Install from PyPI:
pip install granuscore
Or install the latest development version locally:
git clone https://github.com/lukasellinger/granuscore.git
cd granuscore
pip install -e .
Optional development dependencies:
pip install -e ".[dev]"
Quick Start
from granuscore import GranuScore
scorer = GranuScore()
text = """
Tony Hawk was born in San Diego.
"""
score = scorer(text)
print(score)
By default, Granuscore returns percentile scores, where higher values correspond to coarser-grained expressions.
Default Configuration
The default configuration reproduces the setup used in the paper.
scorer = GranuScore()
Equivalent to:
scorer = GranuScore(
predictor_type="hit",
)
Default settings:
predictor_type="hit"model_name="Hierarchy-Transformers/HiT-MiniLM-L12-WordNetNoun"search_method="random_anchors"random_anchors_k=999
Required artifacts such as:
- FAISS indices,
- anchor vectors,
- LightGBM models,
- and reference percentile distributions
are automatically downloaded and cached on first use.
Important Compatibility Note
The default configuration works out of the box and is the recommended setup.
If you customize components such as:
- the embedding model,
- search method,
- FAISS index,
- anchor vectors,
- or LightGBM model,
you must ensure that all resources are compatible with each other.
For example, a LightGBM model trained using:
search_method="random_anchors"
should not be combined with:
search_method="nearest_neighbor"
Similarly, FAISS indices, anchor vectors, percentile reference distributions, and LightGBM models must originate from the same embedding space and training configuration.
Compatibility between custom resources is not validated automatically.
Notebook Tutorial
An interactive introduction is available in:
notebooks/getting_started.ipynb
Repository Structure
granuscore/
├── src/
│ └── granuscore/
│ ├── pipeline.py
│ ├── granularity_predictor.py
│ ├── claim_splitter.py
│ ├── bucket_output.py
│ ├── cache.py
│ └── artifacts.py
├── notebooks/
│ ├── build_granola_dataset.ipynb
│ └── getting_started.ipynb
├── training_scripts/
├── evaluation/
├── assets/
├── data/ (needs to be externally downloaded)
├── pyproject.toml
├── LICENSE
└── README.md
Reproducing Paper Experiments
The datasets and precomputed resources required to reproduce the experiments from the paper are available here:
https://drive.google.com/drive/folders/1mJdUENOxHEiuYn-_f1KRQ1PZggXJDnb4?usp=sharing
Download the archive and extract it into the repository root:
unzip data.zip
This will create the expected directory structure used by the training and evaluation scripts.
GRANOLA Dataset Construction
We use a processed version of the GRANOLA-EQ Dataset introduced by Yona et al. (2024). Due to licensing restrictions of upstream resources, the processed dataset version used in this work is not redistributed directly in this repository or on external hosting platforms.
Instead, the dataset can be reconstructed locally using the notebook:
notebooks/build_granola_dataset.ipynb
Training Pipeline
Training uses precomputed .pkl feature files.
- Generate precomputed datasets:
training_scripts/build_precalc_data/
- Train LightGBM models:
python training_scripts/train_lgb_models.py
Citation
@misc{ellinger2026granuscorereferencefreemeasuregranularity,
title={Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering},
author={Lukas Ellinger and Alexander Fichtl and Miriam Anschütz and Georg Groh},
year={2026},
eprint={2605.26620},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.26620},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file granuscore-1.0.1.tar.gz.
File metadata
- Download URL: granuscore-1.0.1.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4f4e5e24b06a29d714256e265bced4e685820c9744da62165cd9474deceb2ab
|
|
| MD5 |
ed85b4e079e19c3254e3eeb255d74cae
|
|
| BLAKE2b-256 |
941142adffb709757e2f17372d375c0f7aa5c0eeb0463d36c9559043ce68f6f6
|
File details
Details for the file granuscore-1.0.1-py3-none-any.whl.
File metadata
- Download URL: granuscore-1.0.1-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d0325512243ef5e1fb962ccbdd65e61110f5fd165de699274932cb18d747d0c
|
|
| MD5 |
c3ea0a0cb73f9f05123f959e2a02e1db
|
|
| BLAKE2b-256 |
f2a6777dcba572530f1f2c2442954eccecefc093a49a053406f4d3e17dda70a9
|