Multi-turn LLM Conversation Consistency Metric
Project description
TRACE Score
Multi-turn LLM Conversation Consistency Metric
The unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.
The Problem
BLEU, ROUGE, BERTScore, and RAGAS evaluate each conversation turn in isolation. They cannot detect failures that span multiple turns:
| Failure Type | Example | BLEU | ROUGE | BERTScore | TRACE |
|---|---|---|---|---|---|
| Fact forgotten | User says "I am diabetic" at turn 1 → model recommends sugary food at turn 6 | Miss | Miss | Miss | Catch |
| Correction ignored | User corrects model → model reverts to old behavior next turn | Miss | Miss | Miss | Catch |
| Self-contradiction | Model says X at turn 2, contradicts X at turn 7 | Miss | Miss | Miss | Catch |
| Topic drift | Conversation drifts off topic over multiple turns | Miss | Miss | Miss | Catch |
Formula
TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)
Time-decay aggregation weights recent turns more heavily:
Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
| Component | Measures |
|---|---|
| T — Temporal Retention | Did the model remember user-stated facts? |
| R — Reliability Consistency | Did the model contradict itself? |
| A — Adaptive Correction | Did the model retain user corrections? |
| C — Context Coherence | Did the conversation stay on topic? |
| E — Epistemic Stability | Did the model's confidence stay calibrated? |
Default: γ=0.80, λ=0.15, δ=0.10, α=0.05, β=0.05, all wᵢ=0.20
Install
pip install trace-score
Quick Start
from trace_score import compute_TRACE
conversation = [
("user", "I am diabetic and allergic to nuts."),
("assistant", "I will suggest safe low-sugar options."),
("user", "Actually I eat fish too. I am pescatarian."),
("assistant", "Spicy chicken with cashews would be great!"),
]
result = compute_TRACE(conversation, verbose=True)
print(result["trace_score"]) # 0.41
print(result["A"]) # 0.00 — correction ignored
print(result["interpretation"]) # Poor consistency
print(result["formula_breakdown"])
Batch Evaluation
from trace_score import TRACEEvaluator
evaluator = TRACEEvaluator() # models loaded once
results = [evaluator.evaluate(conv) for conv in conversations]
Benchmark Results
Evaluated on 102 multi-turn conversations (34 templates × 3 runs) generated by Llama-3.1-8B via Groq API.
Overall Metric Comparison
| Category | n | TRACE | BLEU | ROUGE-L | BERTScore |
|---|---|---|---|---|---|
| Fact Memory | 36 | 0.688 | 0.048 | 0.172 | 0.796 |
| Correction | 36 | 0.632 | 0.183 | 0.321 | 0.840 |
| Contradiction | 30 | 0.871 | 0.124 | 0.255 | 0.837 |
| Overall | 102 | 0.721 | 0.108 | 0.236 | 0.822 |
TRACE category separation range: 0.239 BERTScore category separation range: 0.044 TRACE separates 5.4x more than BERTScore across categories.
TRACE Component Breakdown by Category
| Category | T | R | A | C | E |
|---|---|---|---|---|---|
| Fact Memory | 0.137 | 0.955 | 1.000 | 0.503 | 0.697 |
| Correction | 0.491 | 0.927 | 0.144 | 0.465 | 0.712 |
| Contradiction | 0.973 | 0.875 | 0.900 | 0.510 | 0.696 |
The A component (Adaptive Correction) drops to 0.144 for Correction conversations, revealing that Llama-3.1-8B ignores user corrections 85.6% of the time. BERTScore scores the same conversations at 0.840. This failure is invisible to all per-turn metrics.
The Gap TRACE Reveals
Conversations where BERTScore is high but TRACE is low:
| Category | TRACE | BERTScore | Gap |
|---|---|---|---|
| Correction | 0.314 | 0.876 | 0.562 |
| Correction | 0.381 | 0.861 | 0.480 |
| Correction | 0.442 | 0.822 | 0.380 |
| Correction | 0.494 | 0.864 | 0.370 |
| Correction | 0.535 | 0.884 | 0.349 |
In all cases, A=0.00 — the model acknowledged the correction but failed to retain it. BERTScore sees fluent per-turn outputs and reports high scores. TRACE sees the cross-turn failure.
Human Evaluation
102 conversations rated by 3 annotators on 5 consistency dimensions (Q1 Memory, Q2 No-Contradiction, Q3 Correction, Q4 Coherence, Q5 Overall, scale 1-5).
| Annotator | n | Q1 | Q2 | Q3 | Q4 | Q5 |
|---|---|---|---|---|---|---|
| Girinath V | 34 | 4.50 | 4.41 | 4.35 | 4.47 | 4.35 |
| Hari V | 34 | 4.09 | 4.29 | 4.26 | 4.26 | 4.21 |
| Kaarthic VR | 34 | 4.85 | 4.88 | 4.94 | 4.97 | 4.85 |
| Combined | 102 | 4.48 | 4.53 | 4.52 | 4.57 | 4.47 |
Human overall mean: 4.47/5 (0.868 normalized to [0,1]).
Why TRACE?
| Metric | Multi-turn | Reference-free | Deterministic | Time-decay | Diagnostic |
|---|---|---|---|---|---|
| BLEU | No | No | Yes | No | No |
| ROUGE | No | No | Yes | No | No |
| BERTScore | No | No | Yes | No | No |
| RAGAS | No | Yes | No | No | Partial |
| TRACE | Yes | Yes | Yes | Yes | Yes |
Models Used
| Model | Purpose | Size |
|---|---|---|
| all-MiniLM-L6-v2 | Semantic similarity (T, A, C, E) | 80MB |
| cross-encoder/nli-deberta-v3-small | Contradiction detection (R, A) | 184MB |
Models download automatically on first use. CPU-friendly, no GPU required.
Citation
@article{girinathv2026trace,
title = {TRACE: A Unified Deterministic Metric for Multi-turn
Conversational Consistency in Large Language Models},
author = {Girinath.V},
year = {2026}
}
Author: Girinath V GitHub: https://github.com/Giri530/trace-score
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trace_score-0.1.1.tar.gz.
File metadata
- Download URL: trace_score-0.1.1.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08183efbab11ed5e4defe1bf85386af14133d9e71f551b9fd4b3424e4031ce75
|
|
| MD5 |
18837c5f9d44b5970d96e988f27ef541
|
|
| BLAKE2b-256 |
cc627ec34d644e2b29643f5356ac7c9803819d4fead23dd32695c2008add7940
|
File details
Details for the file trace_score-0.1.1-py3-none-any.whl.
File metadata
- Download URL: trace_score-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91325f43049fa2d3a367a9bf55343979b753fcfec0ee81528a527b3282c6d44e
|
|
| MD5 |
c583138efab44eeecf839491ea9002a6
|
|
| BLAKE2b-256 |
97ce0a51ae620e444397a2a1e19349fd87bc8acf97b63555d1f0ef2b50aae76f
|