Skip to main content

Extract per-field confidence scores from LLM structured JSON outputs using token-level log-probabilities.

Project description

llm-structured-confidence

Extract per-field confidence scores from LLM structured JSON outputs using token-level log-probabilities.

License Python PyPI

The ProblemInstallationQuick StartHighlightsDocumentationSupported Providers


Designed for Structured Outputs — available in OpenAI, Gemini, and other providers. Works with any JSON schema, but ideal for ENUM-based classification where the model picks from a fixed set of values.

We recommend litellm as a unified interface for calling any provider with structured output and logprobs.

For a compact end-to-end guide to the full public API, see docs/USAGE.md.

The Problem

When an LLM returns structured JSON with logprobs, tokens don't align with field values. A token like ":" can merge a colon, a quote, and part of the value — all with one logprob.

Token          Logprob     What it contains
──────────────────────────────────────────────
'{"'           -0.006      { and opening "
'category'      0.000      the key
'":"'          -0.200      closing ", colon, opening "  ← structural, NOT the value
'health'       -0.168      ← actual value content
' and'          0.000      ← actual value content
' wellness'     0.000      ← actual value content
'"}'            0.000      closing " and }

Naively summing all overlapping tokens gives 69% instead of the correct 84.5%.

This library parses the JSON precisely, strips structural tokens, and computes confidence using only the tokens that carry actual value content.

Installation

pip install llm-structured-confidence

For DataFrame helpers:

pip install "llm-structured-confidence[pandas]"

Import path:

from llm_structured_confidence import extract_field_logprobs, extract_path_logprobs

Or from source:

git clone https://github.com/rodolfonobrega/llm-structured-confidence.git
cd llm-structured-confidence
pip install -e ".[dev]"

Quick Start

import litellm
from llm_structured_confidence import extract_field_logprobs

response = litellm.completion(
    model="gpt-4.1-mini",  # or any provider: "vertex_ai/gemini-2.5-flash", etc.
    messages=[
        {"role": "system", "content": "Classify this text."},
        {"role": "user", "content": "Morning yoga and meditation session"},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "classification",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "category": {
                        "type": "string",
                        "enum": ["sports", "health and wellness", "technology"],
                    }
                },
                "required": ["category"],
                "additionalProperties": False,
            },
        },
    },
    logprobs=True,
    top_logprobs=5,
)

result = extract_field_logprobs(response, field="category")

for value, fl in result.items():
    print(f"{value}: {fl.mean_nonzero_probability:.2%}")  # health and wellness: 84.51%

Highlights

Three confidence metrics

Metric Formula Best for
joint_probability exp(sum(logprobs)) Strictest — literal sequence probability
mean_probability exp(mean(logprobs)) General-purpose — fair across token counts
mean_nonzero_probability exp(mean(logprobs ≠ 0)) ENUM classification — ignores deterministic tokens

[!TIP] With ENUMs, only the first token carries real uncertainty — the rest are forced by the constraint. mean_nonzero_probability filters those out, giving you the model's true confidence regardless of category name length.

Works with explicit fields, Pydantic, and JSON Schema

result = extract_field_logprobs(response, field="category")
result = extract_field_logprobs(response, response_schema=Classification)
result = extract_field_logprobs(response, response_schema=schema)

Arrays and batch classification

# {"categories": ["health and wellness", "sports", "technology"]}
result = extract_field_logprobs(response, field="categories")

Simple arrays of atomic values are also supported directly:

# {"classifications": ["Positive", "Negative", "Neutral"]}
result = extract_field_logprobs(response, field="classifications")

for value, fl in result.items():
    print(value, fl.mean_nonzero_probability)

If you need item positions, use the path-aware API:

results = extract_path_logprobs(response, field_path="classifications[]")
print(results[0].path)   # classifications[0]
print(results[0].value)  # Positive

Nested arrays of objects

Use the path-aware API when values live inside arrays or nested objects.

results = extract_path_logprobs(response, field_path="classifications[].name")

for entry in results:
    print(entry.path, entry.value, entry.field_logprob.mean_nonzero_probability)
    # classifications[0].name Positive 0.96

Raw batch payloads are supported

Raw OpenAI / Vertex AI batch payloads are supported directly.

# OpenAI batch output line -> use response["body"]
scores = extract_field_logprobs(batch_row["response"]["body"], field="category")

# Vertex AI batch output line -> response dict itself
scores = extract_field_logprobs(batch_row["response"], field="category")

Pandas helpers

For batch output files loaded into a DataFrame, use add_confidence_columns.

import pandas as pd
from llm_structured_confidence import add_confidence_columns

# Vertex AI batch output
df = pd.read_json("vertex_batch_output.jsonl", lines=True)
df = add_confidence_columns(df, response_column="response", field="category")

# OpenAI batch output
df = pd.read_json("openai_batch_output.jsonl", lines=True)
df["body"] = df["response"].apply(lambda r: r["body"])
df = add_confidence_columns(df, response_column="body", field="category")

Resolved top alternatives

result = extract_field_logprobs(response, response_schema=Classification)
fl = result["health and wellness"]

for alt in fl.top_logprobs:
    print(alt.token, "->", alt.resolved_value)
# health -> health and wellness
# tech -> technology
# sport -> sports

If a token prefix is ambiguous across allowed values, resolved_value stays None.

Documentation

Detailed docs live here:

The public API covered in the guide:

  • extract_field_logprobs(...)
  • extract_path_logprobs(...)
  • extract_confidence(...)
  • add_confidence_columns(...)
  • FieldLogprob
  • PathFieldLogprob
  • TokenInfo
  • TopAlternative

How It Works

  1. Normalize — detect litellm/OpenAI or google-genai format, convert to common (content, tokens) representation
  2. Parse — feed JSON to a Lark LALR parser with position tracking
  3. Strip quotes — shrink string ranges by 1 on each side to exclude "
  4. Overlap — include only tokens whose character span overlaps the value range
  5. Metrics — compute the three logprob metrics from included tokens
JSON:   {"category":"health and wellness"}
                    ^^^^^^^^^^^^^^^^^^^^
                    value range [13, 31) ← quotes stripped

Tokens:  '{"'  'category'  '":"'  'health'  ' and'  ' wellness'  '"}'
          ↑                  ↑                                      ↑
       excluded           excluded                               excluded

Included: 'health' + ' and' + ' wellness'  ✓

Supported Providers

Provider Response type Logprobs Structured output docs
litellm (recommended) ModelResponse logprobs=True, top_logprobs=5 JSON mode
OpenAI ChatCompletion logprobs=True, top_logprobs=5 Structured Outputs
OpenAI batch raw dict body with choices from batch output file Batch API
google-genai GenerateContentResponse response_logprobs=True, logprobs=5 Structured output
Vertex AI batch raw dict response with candidates from batch output file Batch predictions

[!TIP] For classification tasks, consider disabling thinking/reasoning to get cleaner logprobs (no reasoning tokens mixed in). This applies to any model that supports it, but depends on your use case — reasoning may improve accuracy for complex classifications.

  • litellm: reasoning_effort="none"
  • google-genai: thinking_config=types.ThinkingConfig(thinking_budget=0)

Lower-level API

For custom workflows, internal modules are available:

from llm_structured_confidence._parser import parse_json_spans, build_token_char_ranges, tokens_for_span
from llm_structured_confidence._converter import normalize_response

parsed = parse_json_spans('{"category": "sports", "count": 2}')
# parsed["category"] → _ValueSpan(value="sports", char_start=15, char_end=21)

norm = normalize_response(response)
# norm.content → JSON string, norm.tokens → list of NormalizedToken

[!NOTE] These are underscore-prefixed internal APIs that may change in minor releases. Prefer extract_field_logprobs when possible.

Running Tests

# Unit tests
pytest llm_structured_confidence/tests/test_unit.py -v

# E2E tests
pytest llm_structured_confidence/tests/test_e2e.py -v -s

# All tests
pytest -v

Publishing

Release automation for PyPI via GitHub Actions is documented in RELEASING.md.

For the common release flow, you can also use:

./scripts/release.sh X.Y.Z

The script refuses to run outside main, with a dirty Git tree, without gh auth, or if the release tag already exists.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_structured_confidence-0.3.0.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_structured_confidence-0.3.0-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file llm_structured_confidence-0.3.0.tar.gz.

File metadata

File hashes

Hashes for llm_structured_confidence-0.3.0.tar.gz
Algorithm Hash digest
SHA256 acd552ad96be872b0460dcf8dfb6301c1d74139b863a7d9b4eed4fd62045994a
MD5 a4fd18c2672f5616a96a3e86a0a1c3e9
BLAKE2b-256 818aeb4ac44a27ff478103b85dda585c97a347843ddae6fcefbd51b262deb50c

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_structured_confidence-0.3.0.tar.gz:

Publisher: release.yml on rodolfonobrega/llm-structured-confidence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_structured_confidence-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_structured_confidence-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1dc5fcf9e049d8f270d163f813cbc2601d701c08037e11d0864bfa325d6f9b4c
MD5 21e9890455e9990b8caa2d6c73756261
BLAKE2b-256 1b5e8dde4e23ef1eb04091add1c38403b83793c9a7424484bfa01dee83bc0139

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_structured_confidence-0.3.0-py3-none-any.whl:

Publisher: release.yml on rodolfonobrega/llm-structured-confidence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page