NLP toolkit for annotator polarization research: synthetic datasets, polarized trees, and disagreement metrics
Project description
polartox
NLP toolkit for annotator polarization research. Provides tools for synthetic dataset generation and polarization detection in human annotation studies.
Install
pip install polartox
# with nDFU support (Pavlopoulos & Likas, 2024 -- github.com/ipavlopoulos/ndfu)
pip install "polartox[ndfu]"
Tools
| Module | Description | Status |
|---|---|---|
polartox.datagen |
Synthetic annotator pool with injected, ground-truth polarization | Stable |
polartox.trees |
Polarized Trees detection algorithm | Coming soon |
polartox.datagen
Builds a pool of annotators with explicit demographic identities and generates annotation datasets where every text independently gets k active dimensions (0–4) that drive its disagreement:
- k = 0 — no dimension explains anything, a true unimodal negative control
- k ≥ 1 — a random subset of dimensions is active, each with its own random toxic/civil lean split and a continuous intensity (
alpha) controlling how strongly it pulls toward its pole
Identities' rating distributions are built by taking the elementwise product of their active-dimension shapes — signal composes rather than averages away, reaching the full nDFU range instead of collapsing toward the middle.
The generative config is returned alongside the dataset as ground truth, enabling direct validation of detection algorithms.
from polartox.datagen import AnnotatorPool, DEFAULT_DIMENSIONS, DEFAULT_DEPTH_WEIGHTS, DEFAULT_INTENSITY_RANGE
pool = AnnotatorPool(
dimensions=DEFAULT_DIMENSIONS,
scale=5,
intensity_range=DEFAULT_INTENSITY_RANGE,
depth_weights=DEFAULT_DEPTH_WEIGHTS,
annotators_per_identity=10,
)
result = pool.generate_dataset(
n_texts=100,
n_annotators_per_text=150,
noise=0.05,
seed=42,
)
dataset, ground_truth = result
# dataset columns: text_id, annotator_id, <dimensions>, rating
# ground_truth: per-text active_dims, lean (toxic/civil split), and alpha (intensity)
nDFU scoring is provided by the collaborative ndfu package (Pavlopoulos & Likas, 2024) rather than reimplemented here:
from ndfu import dfu, pdf
text_data = dataset[dataset["text_id"] == 0]
hist = pdf(text_data["rating"].tolist(), range(1, pool.scale + 1))
score = dfu(hist)
Full API documentation available on GitHub.
Changelog
See CHANGELOG.md for release history.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polartox-0.1.1.tar.gz.
File metadata
- Download URL: polartox-0.1.1.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9372ecda0a5e7af579dbede9df7a88273f873bcd5e79bb80dff6dd8b97b4f385
|
|
| MD5 |
d2f45229ba79b491c415e8a3fbbd1f1b
|
|
| BLAKE2b-256 |
4f5813bf8d934ebd32d28c5e0266dac828f25890c967cf1072c8541230eb8d2e
|
File details
Details for the file polartox-0.1.1-py3-none-any.whl.
File metadata
- Download URL: polartox-0.1.1-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42ceb6d9b086e95f410312a68a76dffd5797027329ebb9f77be7c4a520ff0d8d
|
|
| MD5 |
0deb852463b4c6663a681eae274e3768
|
|
| BLAKE2b-256 |
b3a0f49640c68893bc8872100cde63ca4b2e29ca089ee32f999762ac15e452f9
|