Oncoshot LLM validation framework
Project description
LLM Validation Framework
A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.
โจ Key Features
- Multi-field validation - Binary (True/False), scalar (single values), and list (multiple values) data types
- Partial labeling support - Handle datasets where different cases have labels for different subsets of fields
- Dual usage modes - Validate pre-computed results OR run live LLM inference with validation
- Comprehensive metrics - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
- Confidence analysis - Automatic performance breakdown by confidence levels
- Statistical uncertainty - Non-parametric bootstrap confidence intervals for all performance metrics
- Production ready - Parallel processing, intelligent caching, detailed progress tracking
๐ Quick Start
Prerequisites
# Install from PyPI
pip install llmvalidate
# OR install from source
pip install -r requirements.txt # Python 3.11+ required
Demo
python runme.py
Processes the included samples.csv (14 test cases covering all validation scenarios) and outputs timestamped results to validation_results/samples/:
- Results CSV - Row-by-row comparison with confusion matrix counts and item-level details
- Metrics CSV - Aggregated performance statistics with confidence breakdowns
- CI Metrics CSV - Confidence intervals for metrics
| Rows | Field Type | Test Scenarios |
|---|---|---|
| 1-4 | Binary (Has metastasis) |
True Positive, True Negative, False Positive, False Negative |
| 5-9 | Scalar (Diagnosis, Histology) |
Correct, incorrect, missing, spurious, and empty extractions |
| 10-14 | List (Treatment Drugs, Test Results) |
Perfect match, spurious items, missing items, correct empty, mixed results |
๐ Usage Modes
Mode 1: Validate Existing Results
When you have LLM predictions in Res: {Field Name} columns:
import pandas as pd
from src.validation import validate
df = pd.read_csv("data.csv", index_col="Patient ID")
# df must contain: "Field Name" and "Res: Field Name" columns
results_df, metrics_df = validate(
source_df=df,
fields=["Diagnosis", "Treatment"], # or None for auto-detection
structure_callback=None,
output_folder="validation_results"
)
Mode 2: Live LLM Inference + Validation
from src.structured import StructuredResult, StructuredGroup, StructuredField
from src.utils import flatten_structured_result
def llm_callback(row, i, raw_text_column_name):
raw_text = row[raw_text_column_name]
# Your LLM inference logic here
result = StructuredResult(
groups=[StructuredGroup(
group_name="medical",
fields=[
StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
]
)]
)
return flatten_structured_result(result), {}
results_df, metrics_df = validate(
source_df=df,
fields=["Diagnosis", "Treatment"],
structure_callback=llm_callback,
raw_text_column_name="medical_report",
output_folder="validation_results",
max_workers=4
)
๐ Input Data Requirements
DataFrame Format
- Unique index - Each row must have a unique identifier (e.g., "Patient ID")
- Label columns - Ground truth values for each field you want to validate
- Result columns (Mode 1 only) - LLM predictions as
Res: {Field Name}columns - Raw text column (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")
Supported Field Types
| Type | Description | Label Examples | Result Examples |
|---|---|---|---|
| Binary | True/False detection | True, False |
True, False |
| Scalar | Single text/numeric value | "Lung Cancer" 42 |
"Breast Cancer" 38 |
| List | Multiple values | ["Drug A", "Drug B"] "['Item1', 'Item2']" |
["Drug A"] [] |
Special Value Handling
"-"= Labeled as "No information is available in the source document"null/empty/NaN= Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)- Lists - Can be Python lists
["a", "b"]or stringified"['a', 'b']"(auto-converted)
Partial Labeling Support
The framework supports partial labeling scenarios where:
- Not every case needs labels for every field
- Different cases can have labels for different subsets of fields
- Missing labels (
null/NaN) are handled gracefully in all metrics calculations - Use
"-"when the document explicitly lacks information about a field - Use
null/NaNwhen the field simply wasn't labeled for that case
๐ Output Files
The framework generates two timestamped CSV files for each validation run:
1. Results CSV (YYYY-MM-DD HH-MM-SS results.csv)
Row-level analysis with detailed per-case metrics:
Original Data:
- All input columns (labels, raw text, etc.)
Res: {Field}columns with LLM predictionsRes: {Field} confidenceandRes: {Field} justification(if available)
Binary Fields:
TP/FP/FN/TN: {Field}- Confusion matrix counts (1 or 0 per row)
Non-Binary Fields:
Cor/Inc/Mis/Spu: {Field}- Item counts per rowCor/Inc/Mis/Spu: {Field} items- Actual item listsPrecision/Recall/F1/F2: {Field}- Per-row metrics (list fields only)
System Columns:
Sys: from cache- Whether result was cached (speeds up duplicate text)Sys: exception- Error information if processing failedSys: time taken- Processing time per row in seconds
2. Metrics CSV (YYYY-MM-DD HH-MM-SS metrics.csv)
Aggregated statistics with confidence breakdowns:
Core Information:
field- Field name being evaluatedconfidence- Confidence level ("Overall", "High", "Medium", "Low", etc.)labeled cases- Total rows with ground truth labelsfield-present cases- Rows where document has information about the field (label is not '-')
Binary Metrics: TP, TN, FP, FN, precision, recall, F1/F2, accuracy, specificity
Non-Binary Metrics: cor, inc, mis, spu, precision/recall/F1/F2 (micro), precision/recall/F1/F2 (macro)
โก Performance Metrics Explained
Binary Classification Metrics
For fields with True/False values (e.g., "Has metastasis"):
Confusion Matrix Counts
| Count | Definition | Example |
|---|---|---|
| TP (True Positive) | Correctly predicted positive | Label: True, Prediction: True โ TP=1 |
| TN (True Negative) | Correctly predicted negative | Label: False, Prediction: False โ TN=1 |
| FP (False Positive) | Incorrectly predicted positive | Label: False, Prediction: True โ FP=1 |
| FN (False Negative) | Incorrectly predicted negative | Label: True, Prediction: False โ FN=1 |
Binary Classification Formulas
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) |
Of all positive predictions, how many were correct? |
| Recall | TP / (TP + FN) |
Of all actual positives, how many were found? |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
Overall percentage of correct predictions |
| Specificity | TN / (TN + FP) |
Of all actual negatives, how many were correctly identified? |
Structured Extraction Metrics
For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):
Core Counts (Per Case Analysis)
| Count | Definition | Example |
|---|---|---|
| Correct (Cor) | Items extracted correctly | Label: ["DrugA", "DrugB"], Prediction: ["DrugA"] โ Cor=1 |
| Missing (Mis) | Items present in label but not extracted | (Same example) โ Mis=1 (DrugB missing) |
| Spurious (Spu) | Items extracted but not in label | Label: ["DrugA"], Prediction: ["DrugA", "DrugC"] โ Spu=1 |
| Incorrect (Inc) | Wrong values for scalar fields | Label: "Cancer", Prediction: "Diabetes" โ Inc=1 |
Structured Extraction Formulas
| Metric | Formula | Meaning |
|---|---|---|
| Precision | Cor / (Cor + Spu + Inc) |
Of all extracted items, how many were correct? |
| Recall | Cor / (Cor + Mis + Inc) |
Of all labeled items, how many were correctly extracted? |
Note: For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.
The following formulas apply to both binary classification and structured extraction metrics:
| Metric | Formula | Meaning |
|---|---|---|
| F1 Score | 2 ร (P ร R) / (P + R) |
Balanced harmonic mean of precision and recall |
| F2 Score | 5 ร (P ร R) / (4P + R) |
Recall-weighted F-score (emphasizes recall over precision) |
Where P = Precision and R = Recall (calculated differently for each metric type).
Bootstrap Confidence Intervals
The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.
Usage
from src.validation import bootstrap_CI
# After running validation to get results_df
ci_results = bootstrap_CI(
res_df=results_df, # Results from validate() function
fields=["diagnosis", "treatment"], # Fields to analyze (or None for auto-detect)
n_bootstrap=5000, # Number of bootstrap samples (default: 5000)
ci=0.95, # Confidence level (default: 0.95 for 95% CI)
random_state=42 # For reproducible results
)
Bootstrap Method
- Resampling unit: Individual cases (not individual predictions)
- Resampling strategy: Sample with replacement to preserve original dataset size
- CI calculation: Percentile method using bootstrap distribution
- Partial labeling: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
- Metrics included: All validation metrics (precision, recall, F1, accuracy, etc.)
Output Format
The bootstrap_CI() function returns a DataFrame with confidence intervals for each field:
| Column | Description |
|---|---|
field |
Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters) |
labeled cases |
Number of labeled cases in the dataset |
{metric}: mean |
Bootstrap mean estimate |
{metric}: lower |
Lower bound of confidence interval |
{metric}: upper |
Upper bound of confidence interval |
Example output:
field labeled cases precision (micro): mean precision (micro): lower precision (micro): upper
0 exceptions 1000 NaN NaN NaN
1 diagnosis 1000 0.82 0.79 0.85
2 treatment 1000 0.91 0.88 0.94
3 N=5000; CI=95% NaN NaN NaN NaN
The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).
Use Cases
- Performance assessment: Quantify uncertainty in reported metrics
- Model comparison: Determine if performance differences are statistically significant
- Sample size planning: Understand precision of estimates with current dataset size
- Publication: Report confidence intervals alongside point estimates
๐ ๏ธ Advanced Configuration
Parallel Processing
validate(
source_df=df,
fields=["diagnosis", "treatment"],
structure_callback=callback,
max_workers=None, # Auto-detect CPU count (or specify number)
use_threads=True # True for I/O-bound (LLM API calls), False for CPU-bound
)
Performance Features
- Automatic caching - Identical raw text inputs are deduplicated and cached
- Progress tracking - Real-time progress bar for long-running validations
- Cache statistics - Check
Sys: from cachecolumn in results to monitor cache hits
Confidence Analysis
When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects Res: {Field} confidence columns and generates:
- Separate metrics for each unique confidence level found in your data
- Overall metrics aggregating across all confidence levels
- Useful for setting confidence thresholds and analyzing prediction reliability
๐งช Development & Testing
# Install development dependencies
pip install -r requirements.txt
# Run all tests
pytest
# Run with coverage reporting
pytest --cov=src
# Run specific test modules
pytest tests/validate_test.py # Core validation logic
pytest tests/compare_results_test.py # Comparison algorithms
pytest tests/compare_results_all_test.py # End-to-end comparisons
๐ Project Structure
llm-validation-framework/
โโโ src/
โ โโโ llmvalidate/
โ โโโ validation.py # Main validation pipeline and metrics calculation
โ โโโ structured.py # Pydantic data models for LLM results
โ โโโ utils.py # Utility functions (list conversion, flattening)
โ โโโ standardize.py # Data standardization helpers
โโโ tests/ # Comprehensive test suite
โโโ validation_results/ # Output directory (auto-created)
โโโ samples.csv # Demo dataset with all validation scenarios
โโโ runme.py # Demo script
โโโ requirements.txt # Dependencies (pandas, pydantic, tqdm, etc.)
๐ง Troubleshooting
| Error | Solution |
|---|---|
| "Cannot infer fields" | Ensure DataFrame has both {Field} and Res: {Field} columns when structure_callback=None |
| "Missing fields" | Verify fields parameter contains column names that exist in your DataFrame |
| "Duplicate index" | Use df.reset_index(drop=True) or ensure your DataFrame index has unique values |
| Import/dependency errors | Run pip install -r requirements.txt and verify Python 3.11+ |
| Slow performance | Enable parallel processing with max_workers=None and use_threads=True for LLM API calls |
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmvalidate-0.4.3.tar.gz.
File metadata
- Download URL: llmvalidate-0.4.3.tar.gz
- Upload date:
- Size: 27.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d350b0a2cca826f594284fbd292666c9c179b8650a9e7da5a7a78228fdfd9fc2
|
|
| MD5 |
5c7b241408ebaacee39be44308d3a6c0
|
|
| BLAKE2b-256 |
0fadcee94b3c88c7985d126ac0054cd0c4b66782c7f7c01530a337c7fdef3c66
|
File details
Details for the file llmvalidate-0.4.3-py3-none-any.whl.
File metadata
- Download URL: llmvalidate-0.4.3-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c31569af638970e44efc7451df4bb7be9db105a51cf9f3940e627adba279e43f
|
|
| MD5 |
e83c69b574e285297b664c6d16dbeb21
|
|
| BLAKE2b-256 |
38e5efd6978d48409139e77673866553d26ad7fac0612a1dc00f3f1fd192295d
|