Skip to main content

Oncoshot LLM validation framework

Project description

LLM Validation Framework

CI PyPI version Python versions License GitHub stars

A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.

โœจ Key Features

  • Multi-field validation - Binary (True/False), scalar (single values), and list (multiple values) data types
  • Partial labeling support - Handle datasets where different cases have labels for different subsets of fields
  • Dual usage modes - Validate pre-computed results OR run live LLM inference with validation
  • Comprehensive metrics - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
  • Confidence analysis - Automatic performance breakdown by confidence levels
  • Statistical uncertainty - Non-parametric bootstrap confidence intervals for all performance metrics
  • Production ready - Parallel processing, intelligent caching, detailed progress tracking

๐Ÿš€ Quick Start

Prerequisites

# Install from PyPI
pip install llmvalidate

# OR install from source
pip install -r requirements.txt  # Python 3.11+ required

Demo

python runme.py

Processes the included samples.csv (14 test cases covering all validation scenarios) and outputs timestamped results to validation_results/samples/:

  • Results CSV - Row-by-row comparison with confusion matrix counts and item-level details
  • Metrics CSV - Aggregated performance statistics with confidence breakdowns
  • CI Metrics CSV - Confidence intervals for metrics
Rows Field Type Test Scenarios
1-4 Binary (Has metastasis) True Positive, True Negative, False Positive, False Negative
5-9 Scalar (Diagnosis, Histology) Correct, incorrect, missing, spurious, and empty extractions
10-14 List (Treatment Drugs, Test Results) Perfect match, spurious items, missing items, correct empty, mixed results

๐Ÿ“Š Usage Modes

Mode 1: Validate Existing Results

When you have LLM predictions in Res: {Field Name} columns:

import pandas as pd
from src.validation import validate

df = pd.read_csv("data.csv", index_col="Patient ID")
# df must contain: "Field Name" and "Res: Field Name" columns

results_df, metrics_df = validate(
    source_df=df,
    fields=["Diagnosis", "Treatment"],  # or None for auto-detection
    structure_callback=None,
    output_folder="validation_results"
)

Mode 2: Live LLM Inference + Validation

from src.structured import StructuredResult, StructuredGroup, StructuredField
from src.utils import flatten_structured_result

def llm_callback(row, i, raw_text_column_name):
    raw_text = row[raw_text_column_name]
    # Your LLM inference logic here
    result = StructuredResult(
        groups=[StructuredGroup(
            group_name="medical",
            fields=[
                StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
                StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
            ]
        )]
    )
    return flatten_structured_result(result), {}

results_df, metrics_df = validate(
    source_df=df,
    fields=["Diagnosis", "Treatment"],
    structure_callback=llm_callback,
    raw_text_column_name="medical_report",
    output_folder="validation_results",
    max_workers=4
)

๐Ÿ“‹ Input Data Requirements

DataFrame Format

  • Unique index - Each row must have a unique identifier (e.g., "Patient ID")
  • Label columns - Ground truth values for each field you want to validate
  • Result columns (Mode 1 only) - LLM predictions as Res: {Field Name} columns
  • Raw text column (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")

Supported Field Types

Type Description Label Examples Result Examples
Binary True/False detection True, False True, False
Scalar Single text/numeric value "Lung Cancer"
42
"Breast Cancer"
38
List Multiple values ["Drug A", "Drug B"]
"['Item1', 'Item2']"
["Drug A"]
[]

Special Value Handling

  • "-" = Labeled as "No information is available in the source document"
  • null/empty/NaN = Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)
  • Lists - Can be Python lists ["a", "b"] or stringified "['a', 'b']" (auto-converted)

Partial Labeling Support

The framework supports partial labeling scenarios where:

  • Not every case needs labels for every field
  • Different cases can have labels for different subsets of fields
  • Missing labels (null/NaN) are handled gracefully in all metrics calculations
  • Use "-" when the document explicitly lacks information about a field
  • Use null/NaN when the field simply wasn't labeled for that case

๐Ÿ“ˆ Output Files

The framework generates two timestamped CSV files for each validation run:

1. Results CSV (YYYY-MM-DD HH-MM-SS results.csv)

Row-level analysis with detailed per-case metrics:

Original Data:

  • All input columns (labels, raw text, etc.)
  • Res: {Field} columns with LLM predictions
  • Res: {Field} confidence and Res: {Field} justification (if available)

Binary Fields:

  • TP/FP/FN/TN: {Field} - Confusion matrix counts (1 or 0 per row)

Non-Binary Fields:

  • Cor/Inc/Mis/Spu: {Field} - Item counts per row
  • Cor/Inc/Mis/Spu: {Field} items - Actual item lists
  • Precision/Recall/F1/F2: {Field} - Per-row metrics (list fields only)

System Columns:

  • Sys: from cache - Whether result was cached (speeds up duplicate text)
  • Sys: exception - Error information if processing failed
  • Sys: time taken - Processing time per row in seconds

2. Metrics CSV (YYYY-MM-DD HH-MM-SS metrics.csv)

Aggregated statistics with confidence breakdowns:

Core Information:

  • field - Field name being evaluated
  • confidence - Confidence level ("Overall", "High", "Medium", "Low", etc.)
  • labeled cases - Total rows with ground truth labels
  • field-present cases - Rows where document has information about the field (label is not '-')

Binary Metrics: TP, TN, FP, FN, precision, recall, F1/F2, accuracy, specificity

Non-Binary Metrics: cor, inc, mis, spu, precision/recall/F1/F2 (micro), precision/recall/F1/F2 (macro)

โšก Performance Metrics Explained

Binary Classification Metrics

For fields with True/False values (e.g., "Has metastasis"):

Confusion Matrix Counts

Count Definition Example
TP (True Positive) Correctly predicted positive Label: True, Prediction: True โ†’ TP=1
TN (True Negative) Correctly predicted negative Label: False, Prediction: False โ†’ TN=1
FP (False Positive) Incorrectly predicted positive Label: False, Prediction: True โ†’ FP=1
FN (False Negative) Incorrectly predicted negative Label: True, Prediction: False โ†’ FN=1

Binary Classification Formulas

Metric Formula Meaning
Precision TP / (TP + FP) Of all positive predictions, how many were correct?
Recall TP / (TP + FN) Of all actual positives, how many were found?
Accuracy (TP + TN) / (TP + TN + FP + FN) Overall percentage of correct predictions
Specificity TN / (TN + FP) Of all actual negatives, how many were correctly identified?

Structured Extraction Metrics

For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):

Core Counts (Per Case Analysis)

Count Definition Example
Correct (Cor) Items extracted correctly Label: ["DrugA", "DrugB"], Prediction: ["DrugA"] โ†’ Cor=1
Missing (Mis) Items present in label but not extracted (Same example) โ†’ Mis=1 (DrugB missing)
Spurious (Spu) Items extracted but not in label Label: ["DrugA"], Prediction: ["DrugA", "DrugC"] โ†’ Spu=1
Incorrect (Inc) Wrong values for scalar fields Label: "Cancer", Prediction: "Diabetes" โ†’ Inc=1

Structured Extraction Formulas

Metric Formula Meaning
Precision Cor / (Cor + Spu + Inc) Of all extracted items, how many were correct?
Recall Cor / (Cor + Mis + Inc) Of all labeled items, how many were correctly extracted?

Note: For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.

The following formulas apply to both binary classification and structured extraction metrics:

Metric Formula Meaning
F1 Score 2 ร— (P ร— R) / (P + R) Balanced harmonic mean of precision and recall
F2 Score 5 ร— (P ร— R) / (4P + R) Recall-weighted F-score (emphasizes recall over precision)

Where P = Precision and R = Recall (calculated differently for each metric type).

Bootstrap Confidence Intervals

The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.

Usage

from src.validation import bootstrap_CI

# After running validation to get results_df
ci_results = bootstrap_CI(
    res_df=results_df,           # Results from validate() function
    fields=["diagnosis", "treatment"],  # Fields to analyze (or None for auto-detect)
    n_bootstrap=5000,            # Number of bootstrap samples (default: 5000)
    ci=0.95,                     # Confidence level (default: 0.95 for 95% CI)
    random_state=42              # For reproducible results
)

Bootstrap Method

  • Resampling unit: Individual cases (not individual predictions)
  • Resampling strategy: Sample with replacement to preserve original dataset size
  • CI calculation: Percentile method using bootstrap distribution
  • Partial labeling: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
  • Metrics included: All validation metrics (precision, recall, F1, accuracy, etc.)

Output Format

The bootstrap_CI() function returns a DataFrame with confidence intervals for each field:

Column Description
field Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters)
labeled cases Number of labeled cases in the dataset
{metric}: mean Bootstrap mean estimate
{metric}: lower Lower bound of confidence interval
{metric}: upper Upper bound of confidence interval

Example output:

        field  labeled cases  precision (micro): mean  precision (micro): lower  precision (micro): upper
0  exceptions          1000                       NaN                       NaN                       NaN
1   diagnosis          1000                      0.82                      0.79                      0.85
2   treatment          1000                      0.91                      0.88                      0.94
3  N=5000; CI=95%       NaN                       NaN                       NaN                       NaN

The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).

Use Cases

  • Performance assessment: Quantify uncertainty in reported metrics
  • Model comparison: Determine if performance differences are statistically significant
  • Sample size planning: Understand precision of estimates with current dataset size
  • Publication: Report confidence intervals alongside point estimates

๐Ÿ› ๏ธ Advanced Configuration

Parallel Processing

validate(
    source_df=df,
    fields=["diagnosis", "treatment"], 
    structure_callback=callback,
    max_workers=None,      # Auto-detect CPU count (or specify number)
    use_threads=True       # True for I/O-bound (LLM API calls), False for CPU-bound
)

Performance Features

  • Automatic caching - Identical raw text inputs are deduplicated and cached
  • Progress tracking - Real-time progress bar for long-running validations
  • Cache statistics - Check Sys: from cache column in results to monitor cache hits

Confidence Analysis

When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects Res: {Field} confidence columns and generates:

  • Separate metrics for each unique confidence level found in your data
  • Overall metrics aggregating across all confidence levels
  • Useful for setting confidence thresholds and analyzing prediction reliability

๐Ÿงช Development & Testing

# Install development dependencies
pip install -r requirements.txt

# Run all tests
pytest  

# Run with coverage reporting
pytest --cov=src

# Run specific test modules
pytest tests/validate_test.py              # Core validation logic
pytest tests/compare_results_test.py       # Comparison algorithms  
pytest tests/compare_results_all_test.py   # End-to-end comparisons

๐Ÿ“ Project Structure

llm-validation-framework/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ llmvalidate/
โ”‚       โ”œโ”€โ”€ validation.py     # Main validation pipeline and metrics calculation
โ”‚       โ”œโ”€โ”€ structured.py     # Pydantic data models for LLM results
โ”‚       โ”œโ”€โ”€ utils.py         # Utility functions (list conversion, flattening)
โ”‚       โ””โ”€โ”€ standardize.py   # Data standardization helpers
โ”œโ”€โ”€ tests/               # Comprehensive test suite
โ”œโ”€โ”€ validation_results/  # Output directory (auto-created)
โ”œโ”€โ”€ samples.csv         # Demo dataset with all validation scenarios  
โ”œโ”€โ”€ runme.py           # Demo script
โ””โ”€โ”€ requirements.txt   # Dependencies (pandas, pydantic, tqdm, etc.)

๐Ÿ”ง Troubleshooting

Error Solution
"Cannot infer fields" Ensure DataFrame has both {Field} and Res: {Field} columns when structure_callback=None
"Missing fields" Verify fields parameter contains column names that exist in your DataFrame
"Duplicate index" Use df.reset_index(drop=True) or ensure your DataFrame index has unique values
Import/dependency errors Run pip install -r requirements.txt and verify Python 3.11+
Slow performance Enable parallel processing with max_workers=None and use_threads=True for LLM API calls

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmvalidate-0.4.3.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmvalidate-0.4.3-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file llmvalidate-0.4.3.tar.gz.

File metadata

  • Download URL: llmvalidate-0.4.3.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmvalidate-0.4.3.tar.gz
Algorithm Hash digest
SHA256 d350b0a2cca826f594284fbd292666c9c179b8650a9e7da5a7a78228fdfd9fc2
MD5 5c7b241408ebaacee39be44308d3a6c0
BLAKE2b-256 0fadcee94b3c88c7985d126ac0054cd0c4b66782c7f7c01530a337c7fdef3c66

See more details on using hashes here.

File details

Details for the file llmvalidate-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: llmvalidate-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llmvalidate-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c31569af638970e44efc7451df4bb7be9db105a51cf9f3940e627adba279e43f
MD5 e83c69b574e285297b664c6d16dbeb21
BLAKE2b-256 38e5efd6978d48409139e77673866553d26ad7fac0612a1dc00f3f1fd192295d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page