Multi-turn LLM Conversation Consistency Metric
Project description
TRACE Score
Multi-turn LLM Conversation Consistency Metric
The first unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.
The Problem
Existing metrics (BLEU, ROUGE, BERTScore, RAGAS) evaluate each conversation turn in isolation. They cannot detect failures that only become visible across multiple turns:
| Failure Type | Example | BLEU | ROUGE | BERTScore | TRACE |
|---|---|---|---|---|---|
| Fact forgotten | User says "I am diabetic" → model recommends sugar-rich food 5 turns later | Miss | Miss | Miss | Catch |
| Correction ignored | User corrects model → model reverts to old behavior | Miss | Miss | Miss | Catch |
| Self-contradiction | Model says X at turn 2, contradicts X at turn 7 | Miss | Miss | Miss | Catch |
| Topic drift | Conversation gradually drifts off-topic | Miss | Miss | Miss | Catch |
| Confidence drift | Model says "definitely" then "perhaps" about same claim | Miss | Miss | Miss | Catch |
Formula
TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)
Each component uses time-decay aggregation — recent turns weighted more:
Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
Z = Σ γ^(N-t)
| Symbol | Component | Measures |
|---|---|---|
| T | Temporal Retention | Did assistant remember user-stated facts? |
| R | Reliability Consistency | Did assistant contradict itself? |
| A | Adaptive Correction | Did assistant retain user corrections? |
| C | Context Coherence | Did conversation stay on topic? |
| E | Epistemic Stability | Did confidence stay calibrated? |
| P | Contradiction penalty | Global contradiction rate |
| V | Variance penalty | Confidence variance |
| γ | Time decay factor | Default: 0.80 |
| λ | Contradiction weight | Default: 0.15 |
| δ | Variance weight | Default: 0.10 |
| α | T·C interaction | Default: 0.05 |
| β | A·R interaction | Default: 0.05 |
Install
pip install trace-score
Quick Start
from trace_score import compute_TRACE
conversation = [
("user", "I am diabetic and hate spicy food"),
("assistant", "I will suggest low sugar mild options."),
("user", "Actually I eat fish too. I am pescatarian."),
("assistant", "Spicy chicken with cashews!"), # failure turn
]
result = compute_TRACE(conversation, verbose=True)
print(result["trace_score"]) # 0.41 — catches failures
print(result["T"]) # 0.50 — forgot user facts
print(result["A"]) # 0.00 — ignored correction
print(result["formula_breakdown"]) # full formula with values
print(result["interpretation"]) # "Poor consistency"
Batch Evaluation
from trace_score import TRACEEvaluator
# Models loaded once, reused across all calls — much faster
evaluator = TRACEEvaluator()
results = [evaluator.evaluate(conv) for conv in conversations]
Adaptive Weights
# Equal weights (default)
result = compute_TRACE(conv, preset="equal")
# Medical chatbot — memory and reliability weighted more
result = compute_TRACE(conv, preset="medical_chatbot")
# Custom weights — must sum to 1.0
result = compute_TRACE(conv, weights={
"w_T": 0.35, "w_R": 0.25,
"w_A": 0.20, "w_C": 0.10, "w_E": 0.10
})
Available presets: equal, customer_service, technical_qa,
medical_chatbot, education_tutor
Benchmark Results
Evaluated on 30 multi-turn conversations across 3 categories (Fact Memory, Correction Retention, Contradiction Detection). Conversations generated by Llama-3.1-8B via Groq API.
Overall Metric Comparison
| Metric | Overall | Fact Memory | Correction | Contradiction |
|---|---|---|---|---|
| TRACE | 0.699 | 0.703 | 0.550 | 0.843 |
| BLEU | 0.102 | 0.046 | 0.149 | 0.110 |
| ROUGE-L | 0.239 | 0.177 | 0.301 | 0.239 |
| BERTScore | 0.822 | 0.800 | 0.842 | 0.823 |
Key finding: BLEU and ROUGE-L show similar low scores across all categories — they cannot distinguish between different types of consistency failures. BERTScore appears high but provides no diagnostic breakdown. TRACE clearly separates Correction (0.550) from Contradiction (0.843), revealing that Llama-3.1-8B struggles most with retaining user corrections.
TRACE Component Breakdown by Category
| Category | T | R | A | C | E |
|---|---|---|---|---|---|
| Fact Memory | 0.137 | 0.955 | 1.000 | 0.503 | 0.697 |
| Correction | 0.491 | 0.927 | 0.144 | 0.465 | 0.712 |
| Contradiction | 0.973 | 0.875 | 0.900 | 0.510 | 0.696 |
Diagnostic insight:
- Fact Memory: T=0.137 — model forgets user-stated facts (A=1.0 means no corrections were needed, so A is vacuously true here)
- Correction: A=0.144 — model ignores user corrections (critical failure)
- Contradiction: T=0.973, A=0.900 — model handles these well
No existing metric (BLEU, ROUGE, BERTScore) can produce this breakdown.
The Gap TRACE Reveals — BERTScore vs TRACE
Conversations where BERTScore is high but TRACE is low (failures invisible to BERTScore, caught by TRACE):
| Conversation | Category | TRACE | BERTScore | Gap |
|---|---|---|---|---|
| CR_006 | Correction | 0.314 | 0.876 | +0.562 |
| CR_009 | Correction | 0.381 | 0.861 | +0.480 |
| CR_004 | Correction | 0.535 | 0.884 | +0.349 |
| CR_003 | Correction | 0.494 | 0.864 | +0.370 |
| CR_002 | Correction | 0.442 | 0.822 | +0.380 |
In all 5 cases: BERTScore ≥ 0.82 (looks good), TRACE < 0.55 (failures detected). The A component reveals why — user corrections completely ignored (A=0.00). This is invisible to any per-turn metric.
Why TRACE?
| Metric | Multi-turn | Reference-free | Deterministic | Time-decay | Diagnostic |
|---|---|---|---|---|---|
| BLEU | No | No | Yes | No | No |
| ROUGE | No | No | Yes | No | No |
| BERTScore | No | No | Yes | No | No |
| RAGAS | No | Yes | No | No | Partial |
| TRACE | Yes | Yes | Yes | Yes | Yes |
Models Used
| Model | Purpose | Size |
|---|---|---|
all-MiniLM-L6-v2 |
Semantic similarity (T, A, C, E) | 80MB |
cross-encoder/nli-deberta-v3-small |
Contradiction detection (R, A) | 184MB |
Models downloaded automatically on first use (~264MB total). CPU-friendly — no GPU required.
Citation
@article{girinathv2026trace,
title = {TRACE: A Unified Deterministic Metric for Multi-turn
Conversational Consistency in Large Language Models},
author = {Girinath, V},
year = {2026}
}
Author: Girinath V GitHub: https://github.com/Giri530/trace-score
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trace_score-0.1.0.tar.gz.
File metadata
- Download URL: trace_score-0.1.0.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8ff969e4f38e0784ba980ba9d1dd2a3f1b6cb840958919371c402aa9c7c7c52
|
|
| MD5 |
0e15e3b0b704320c112250e3f6ce15bc
|
|
| BLAKE2b-256 |
d7b08e04d158afed782a363bbf3009cbd0973a6747ba80403c1d5da9bf9195d0
|
File details
Details for the file trace_score-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trace_score-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
defb7db98193571c79a550cf64e7343d4fcf189d386f50647e4741beab775d57
|
|
| MD5 |
1dc24ec35710b3710fa04350daf16cdf
|
|
| BLAKE2b-256 |
0c8ac0c2fc272df29e30a408563d4e7ab4a18d3b8d7feccace2b10af9eab79c3
|