Extract per-field confidence scores from LLM structured JSON outputs using token-level log-probabilities.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rodolfonobrega

These details have not been verified by PyPI

Project description

llm-structured-confidence

Extract per-field confidence scores from LLM structured JSON outputs using token-level log-probabilities.

The Problem • Installation • Quick Start • Features • API Reference • Supported Providers

Designed for Structured Outputs — available in OpenAI, Gemini, and other providers. Works with any JSON schema, but ideal for ENUM-based classification where the model picks from a fixed set of values.

We recommend litellm as a unified interface for calling any provider with structured output and logprobs.

The Problem

When an LLM returns structured JSON with logprobs, tokens don't align with field values. A token like ":" can merge a colon, a quote, and part of the value — all with one logprob.

Token          Logprob     What it contains
──────────────────────────────────────────────
'{"'           -0.006      { and opening "
'category'      0.000      the key
'":"'          -0.200      closing ", colon, opening "  ← structural, NOT the value
'health'       -0.168      ← actual value content
' and'          0.000      ← actual value content
' wellness'     0.000      ← actual value content
'"}'            0.000      closing " and }

Naively summing all overlapping tokens gives 69% instead of the correct 84.5%.

This library parses the JSON precisely, strips structural tokens, and computes confidence using only the tokens that carry actual value content.

Installation

pip install llm-structured-confidence

For DataFrame helpers:

pip install "llm-structured-confidence[pandas]"

Import path:

from llm_structured_confidence import extract_field_logprobs

Or from source:

git clone https://github.com/rodolfonobrega/llm-structured-confidence.git
cd llm-structured-confidence
pip install -e ".[dev]"

Quick Start

import litellm
from llm_structured_confidence import extract_field_logprobs

response = litellm.completion(
    model="gpt-4.1-mini",  # or any provider: "vertex_ai/gemini-2.5-flash", etc.
    messages=[
        {"role": "system", "content": "Classify this text."},
        {"role": "user", "content": "Morning yoga and meditation session"},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "classification",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "category": {
                        "type": "string",
                        "enum": ["sports", "health and wellness", "technology"],
                    }
                },
                "required": ["category"],
                "additionalProperties": False,
            },
        },
    },
    logprobs=True,
    top_logprobs=5,
)

result = extract_field_logprobs(response, field="category")

for value, fl in result.items():
    print(f"{value}: {fl.mean_nonzero_probability:.2%}")  # health and wellness: 84.51%

Features

Three confidence metrics

Metric	Formula	Best for
`joint_probability`	`exp(sum(logprobs))`	Strictest — literal sequence probability
`mean_probability`	`exp(mean(logprobs))`	General-purpose — fair across token counts
`mean_nonzero_probability`	`exp(mean(logprobs ≠ 0))`	ENUM classification — ignores deterministic tokens

[!TIP] With ENUMs, only the first token carries real uncertainty — the rest are forced by the constraint. mean_nonzero_probability filters those out, giving you the model's true confidence regardless of category name length.

Scalar fields

result = extract_field_logprobs(response, field="category")
fl = result["health and wellness"]

Array fields (batch classification)

# {"categories": ["health and wellness", "sports", "technology"]}
result = extract_field_logprobs(response, field="categories")
for value, fl in result.items():
    print(f"{value}: {fl.joint_probability:.2%}")

Batch API raw dicts

Raw OpenAI / Vertex AI batch payloads are supported directly.

from llm_structured_confidence import extract_field_logprobs

# OpenAI batch output line -> use response["body"]
scores = extract_field_logprobs(batch_row["response"]["body"], field="category")

# Vertex AI batch output line -> response dict itself
scores = extract_field_logprobs(batch_row["response"], field="category")

Pydantic auto-detection

Pass response_schema= with either the Pydantic model or the JSON Schema you used for structured output. The library finds enum-valued fields automatically.

Internally, both inputs are normalized to JSON Schema before field detection and enum resolution.

from enum import Enum
from pydantic import BaseModel

class CategoryEnum(str, Enum):
    health_and_wellness = "health and wellness"
    sports = "sports"

class Classification(BaseModel):
    category: CategoryEnum

result = extract_field_logprobs(response, response_schema=Classification)

JSON Schema auto-detection

schema = {
    "type": "object",
    "properties": {
        "category": {
            "type": "string",
            "enum": ["sports", "health and wellness", "technology"],
        }
    },
    "required": ["category"],
    "additionalProperties": False,
}

result = extract_field_logprobs(response, response_schema=schema)

google-genai native support

Pass a google.genai.GenerateContentResponse directly — converted internally using the same logic as litellm's Vertex AI adapter.

from google import genai
from google.genai import types

client = genai.Client(vertexai=True, project="my-project", location="global")
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[...],
    config=types.GenerateContentConfig(response_logprobs=True, logprobs=5),
)

result = extract_field_logprobs(response, field="category")  # same interface

Pandas integration

For batch output files loaded into a DataFrame, use add_confidence_columns.

import pandas as pd
from llm_structured_confidence import add_confidence_columns

# Vertex AI batch output
df = pd.read_json("vertex_batch_output.jsonl", lines=True)
df = add_confidence_columns(df, response_column="response", field="category")

# OpenAI batch output
df = pd.read_json("openai_batch_output.jsonl", lines=True)
df["body"] = df["response"].apply(lambda r: r["body"])
df = add_confidence_columns(df, response_column="body", field="category")

When response_schema= is provided, add_confidence_columns() also adds {prefix}_top_alt_resolved with the full enum/literal value for the best alternative when the first token uniquely identifies it.

The same resolution is available in extract_confidence(...) via the top_alternative_resolved key.

Token inspection

for token in fl.tokens:
    print(f"  {token.token!r:20s}  logprob={token.logprob:.4f}  prob={token.probability:.2%}")
# 'health'             logprob=-0.1683  prob=84.51%
# ' and'               logprob= 0.0000  prob=100.00%
# ' wellness'          logprob= 0.0000  prob=100.00%

Structural tokens (":", "}", {") are never included.

Top alternatives

for alt in fl.top_logprobs:
    print(f"  {alt.token!r:20s}  prob={alt.probability:.2%}")
# 'health'             prob=84.51%
# 'tech'               prob=15.47%    ← "technology"
# 'sport'              prob=0.01%     ← "sports"

If you also pass response_schema=..., each alternative preserves the raw token and exposes the resolved full value when the token is enough to identify a unique enum/literal choice.

result = extract_field_logprobs(response, response_schema=Classification)
fl = result["health and wellness"]

for alt in fl.top_logprobs:
    print(alt.token, "->", alt.resolved_value)
# health -> health and wellness
# tech -> technology
# sport -> sports

If a token prefix is ambiguous across allowed values, resolved_value stays None.

Understanding the Metrics

Why log-probabilities? (numerical stability)

Multiplying many small probabilities causes underflow. Logarithms convert multiplication to addition:

log(A × B) = log(A) + log(B)

So instead of P("health") × P(" and") × P(" wellness") = 0.845, we compute sum(logprobs) = -0.168 and convert back: exp(-0.168) = 0.845.

Why mean_nonzero matters for ENUMs

With the enum ["health and wellness", "sports", "technology"], once the model generates "health", the remaining tokens are forced (logprob = 0). The regular mean gets diluted by those zeros:

"health and wellness" (3 tokens): mean = (-0.168 + 0 + 0) / 3 → 94.6%  ← inflated
"technology" (2 tokens):          mean = (-0.088 + 0) / 2     → 95.7%  ← inflated differently

Longer names get more dilution. mean_nonzero fixes this by averaging only tokens where the model had a choice:

"health and wellness": mean_nonzero = -0.168 / 1 → 84.5%  ← real confidence
"technology":          mean_nonzero = -0.088 / 1 → 91.6%  ← real confidence

API Reference

`extract_field_logprobs(response, *, field=None, response_schema=None)`

Parameter	Type	Description
`response`	`Any`	`litellm.ModelResponse`, `openai.ChatCompletion`, or `google.genai.GenerateContentResponse` with logprobs
`field`	`str \| None`	JSON field name (e.g. `"category"`). Takes precedence over `response_schema`.
`response_schema`	`type \| dict[str, Any] \| None`	Pydantic model or JSON Schema — auto-detects enum-valued fields

Returns dict[str, FieldLogprob] — maps each value (as string) to its metrics.

Precedence: field > response_schema > all top-level fields.

`FieldLogprob`

Attribute	Type	Description
`value`	`Any`	The parsed value
`tokens`	`list[TokenInfo]`	Tokens included in the calculation
`joint_logprob`	`float`	Sum of all token logprobs
`joint_probability`	`float`	`exp(joint_logprob)`
`mean_logprob`	`float`	Mean of all token logprobs
`mean_probability`	`float`	`exp(mean_logprob)`
`mean_nonzero_logprob`	`float \| None`	Mean of logprobs where logprob ≠ 0 (or 0.0 if all zero)
`mean_nonzero_probability`	`float \| None`	`exp(mean_nonzero_logprob)` (or 1.0 if all zero)
`top_logprobs`	`list[TopAlternative]`	Alternatives from the first uncertain token

`TokenInfo`

Attribute	Type	Description
`token`	`str`	Token text
`logprob`	`float`	Log-probability
`probability`	`float`	`exp(logprob)` — property
`char_start` / `char_end`	`int`	Position in the JSON string

`TopAlternative`

Attribute	Type	Description
`token`	`str`	Alternative token text
`logprob`	`float`	Its log-probability
`resolved_value`	`Any \| None`	Full enum/literal value when `response_schema=` provides choices and the token prefix matches exactly one
`probability`	`float`	`exp(logprob)` — property

How It Works

Normalize — detect litellm/OpenAI or google-genai format, convert to common (content, tokens) representation
Parse — feed JSON to a Lark LALR parser with position tracking
Strip quotes — shrink string ranges by 1 on each side to exclude "
Overlap — include only tokens whose character span overlaps the value range
Metrics — compute the three logprob metrics from included tokens

JSON:   {"category":"health and wellness"}
                    ^^^^^^^^^^^^^^^^^^^^
                    value range [13, 31) ← quotes stripped

Tokens:  '{"'  'category'  '":"'  'health'  ' and'  ' wellness'  '"}'
          ↑                  ↑                                      ↑
       excluded           excluded                               excluded

Included: 'health' + ' and' + ' wellness'  ✓

Supported Providers

Provider	Response type	Logprobs	Structured output docs
litellm (recommended)	`ModelResponse`	`logprobs=True, top_logprobs=5`	JSON mode
OpenAI	`ChatCompletion`	`logprobs=True, top_logprobs=5`	Structured Outputs
OpenAI batch	raw `dict` body with `choices`	from batch output file	Batch API
google-genai	`GenerateContentResponse`	`response_logprobs=True, logprobs=5`	Structured output
Vertex AI batch	raw `dict` response with `candidates`	from batch output file	Batch predictions

[!TIP] For classification tasks, consider disabling thinking/reasoning to get cleaner logprobs (no reasoning tokens mixed in). This applies to any model that supports it, but depends on your use case — reasoning may improve accuracy for complex classifications.

litellm: reasoning_effort="none"

google-genai: thinking_config=types.ThinkingConfig(thinking_budget=0)

Lower-level API

For custom workflows, internal modules are available:

from llm_structured_confidence._parser import parse_json_spans, build_token_char_ranges, tokens_for_span
from llm_structured_confidence._converter import normalize_response

parsed = parse_json_spans('{"category": "sports", "count": 2}')
# parsed["category"] → _ValueSpan(value="sports", char_start=15, char_end=21)

norm = normalize_response(response)
# norm.content → JSON string, norm.tokens → list of NormalizedToken

[!NOTE] These are underscore-prefixed internal APIs that may change in minor releases. Prefer extract_field_logprobs when possible.

Using with AI Agents

The AGENTS.md file contains a compact API reference designed for LLM-based coding agents (Cursor, Copilot, etc.).

Running Tests

# Unit tests (55 tests, no API calls)
pytest llm_structured_confidence/tests/test_unit.py -v

# E2E tests (6 tests, calls Vertex AI)
pytest llm_structured_confidence/tests/test_e2e.py -v -s

# All tests
pytest -v

Publishing

Release automation for PyPI via GitHub Actions is documented in RELEASING.md.

For the common release flow, you can also use:

./scripts/release.sh X.Y.Z

The script refuses to run outside main, with a dirty Git tree, without gh auth, or if the release tag already exists.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rodolfonobrega

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.5

Mar 13, 2026

0.4.4

Mar 13, 2026

0.4.3

Mar 13, 2026

0.4.2

Mar 13, 2026

0.4.1

Mar 13, 2026

0.4.0

Mar 13, 2026

0.3.0

Mar 13, 2026

This version

0.2.0

Mar 13, 2026

0.1.1

Mar 12, 2026

0.1.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_structured_confidence-0.2.0.tar.gz (31.9 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_structured_confidence-0.2.0-py3-none-any.whl (28.7 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file llm_structured_confidence-0.2.0.tar.gz.

File metadata

Download URL: llm_structured_confidence-0.2.0.tar.gz
Upload date: Mar 13, 2026
Size: 31.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_structured_confidence-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`58c1af0d886fb1232f51fa60c08babad2a903e0b0ccf025266294e04bed56994`
MD5	`9b4a763a7520d66f728d403e99c9c8c5`
BLAKE2b-256	`34103c459a69650b81cfc847f1e0aa68e90737b7f3a78a5ae9c2463719de20fc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_structured_confidence-0.2.0.tar.gz:

Publisher: release.yml on rodolfonobrega/llm-structured-confidence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_structured_confidence-0.2.0.tar.gz
- Subject digest: 58c1af0d886fb1232f51fa60c08babad2a903e0b0ccf025266294e04bed56994
- Sigstore transparency entry: 1096825693
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: rodolfonobrega/llm-structured-confidence@fe7e9f2f387e0d88787387ddc197edcd95c491a4
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/rodolfonobrega
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fe7e9f2f387e0d88787387ddc197edcd95c491a4
- Trigger Event: release

File details

Details for the file llm_structured_confidence-0.2.0-py3-none-any.whl.

File metadata

Download URL: llm_structured_confidence-0.2.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 28.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for llm_structured_confidence-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e95e5c51d79a163cb7b4ee928dce1a6e1fe37cdd7e257cd59859f09e8b9d0214`
MD5	`529650c4c1b646b907d30dcccd0d22d5`
BLAKE2b-256	`51d465cbd543d0a4bccb1c9653967b05b8f2a92e87d50741a533c14ff0cfc490`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_structured_confidence-0.2.0-py3-none-any.whl:

Publisher: release.yml on rodolfonobrega/llm-structured-confidence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_structured_confidence-0.2.0-py3-none-any.whl
- Subject digest: e95e5c51d79a163cb7b4ee928dce1a6e1fe37cdd7e257cd59859f09e8b9d0214
- Sigstore transparency entry: 1096825702
- Sigstore integration time: Mar 13, 2026
Source repository:
- Permalink: rodolfonobrega/llm-structured-confidence@fe7e9f2f387e0d88787387ddc197edcd95c491a4
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/rodolfonobrega
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fe7e9f2f387e0d88787387ddc197edcd95c491a4
- Trigger Event: release

llm-structured-confidence 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

llm-structured-confidence

The Problem

Installation

Quick Start

Features

Three confidence metrics

Scalar fields

Array fields (batch classification)

Batch API raw dicts

Pydantic auto-detection

JSON Schema auto-detection

google-genai native support

Pandas integration

Token inspection

Top alternatives

Understanding the Metrics

API Reference

extract_field_logprobs(response, *, field=None, response_schema=None)

FieldLogprob

TokenInfo

TopAlternative

How It Works

Supported Providers

Lower-level API

Using with AI Agents

Running Tests

Publishing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract_field_logprobs(response, *, field=None, response_schema=None)`

`FieldLogprob`

`TokenInfo`

`TopAlternative`