Skip to main content

Multi-turn LLM Conversation Consistency Metric

Project description

TRACE Score

Multi-turn LLM Conversation Consistency Metric

The unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.

PyPI version License: MIT Python 3.8+


The Problem

BLEU, ROUGE, BERTScore, and RAGAS evaluate each conversation turn in isolation. They cannot detect failures that span multiple turns:

Failure Type Example BLEU ROUGE BERTScore TRACE
Fact forgotten User says "I am diabetic" at turn 1 → model recommends sugary food at turn 6 Miss Miss Miss Catch
Correction ignored User corrects model → model reverts to old behavior next turn Miss Miss Miss Catch
Self-contradiction Model says X at turn 2, contradicts X at turn 7 Miss Miss Miss Catch
Topic drift Conversation drifts off topic over multiple turns Miss Miss Miss Catch

Formula

TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)

Time-decay aggregation weights recent turns more heavily:

Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
Component Measures
T — Temporal Retention Did the model remember user-stated facts?
R — Reliability Consistency Did the model contradict itself?
A — Adaptive Correction Did the model retain user corrections?
C — Context Coherence Did the conversation stay on topic?
E — Epistemic Stability Did the model's confidence stay calibrated?

Default: γ=0.80, λ=0.15, δ=0.10, α=0.05, β=0.05, all wᵢ=0.20


Install

pip install trace-score

Quick Start

from trace_score import compute_TRACE

conversation = [
    ("user",      "I am diabetic and allergic to nuts."),
    ("assistant", "I will suggest safe low-sugar options."),
    ("user",      "Actually I eat fish too. I am pescatarian."),
    ("assistant", "Spicy chicken with cashews would be great!"),
]

result = compute_TRACE(conversation, verbose=True)

print(result["trace_score"])        # 0.41
print(result["A"])                  # 0.00 — correction ignored
print(result["interpretation"])     # Poor consistency
print(result["formula_breakdown"])

Batch Evaluation

from trace_score import TRACEEvaluator

evaluator = TRACEEvaluator()   # models loaded once
results   = [evaluator.evaluate(conv) for conv in conversations]

Benchmark Results

Evaluated on 102 multi-turn conversations (34 templates × 3 runs) generated by Llama-3.1-8B via Groq API.

Overall Metric Comparison

Category n TRACE BLEU ROUGE-L BERTScore
Fact Memory 36 0.688 0.048 0.172 0.796
Correction 36 0.632 0.183 0.321 0.840
Contradiction 30 0.871 0.124 0.255 0.837
Overall 102 0.721 0.108 0.236 0.822

TRACE category separation range: 0.239 BERTScore category separation range: 0.044 TRACE separates 5.4x more than BERTScore across categories.


TRACE Component Breakdown by Category

Category T R A C E
Fact Memory 0.137 0.955 1.000 0.503 0.697
Correction 0.491 0.927 0.144 0.465 0.712
Contradiction 0.973 0.875 0.900 0.510 0.696

The A component (Adaptive Correction) drops to 0.144 for Correction conversations, revealing that Llama-3.1-8B ignores user corrections 85.6% of the time. BERTScore scores the same conversations at 0.840. This failure is invisible to all per-turn metrics.


The Gap TRACE Reveals

Conversations where BERTScore is high but TRACE is low:

Category TRACE BERTScore Gap
Correction 0.314 0.876 0.562
Correction 0.381 0.861 0.480
Correction 0.442 0.822 0.380
Correction 0.494 0.864 0.370
Correction 0.535 0.884 0.349

In all cases, A=0.00 — the model acknowledged the correction but failed to retain it. BERTScore sees fluent per-turn outputs and reports high scores. TRACE sees the cross-turn failure.


Human Evaluation

102 conversations rated by 3 annotators on 5 consistency dimensions (Q1 Memory, Q2 No-Contradiction, Q3 Correction, Q4 Coherence, Q5 Overall, scale 1-5).

Annotator n Q1 Q2 Q3 Q4 Q5
Girinath V 34 4.50 4.41 4.35 4.47 4.35
Hari V 34 4.09 4.29 4.26 4.26 4.21
Kaarthic VR 34 4.85 4.88 4.94 4.97 4.85
Combined 102 4.48 4.53 4.52 4.57 4.47

Human overall mean: 4.47/5 (0.868 normalized to [0,1]).


Why TRACE?

Metric Multi-turn Reference-free Deterministic Time-decay Diagnostic
BLEU No No Yes No No
ROUGE No No Yes No No
BERTScore No No Yes No No
RAGAS No Yes No No Partial
TRACE Yes Yes Yes Yes Yes

Models Used

Model Purpose Size
all-MiniLM-L6-v2 Semantic similarity (T, A, C, E) 80MB
cross-encoder/nli-deberta-v3-small Contradiction detection (R, A) 184MB

Models download automatically on first use. CPU-friendly, no GPU required.


Citation

@article{girinathv2026trace,
  title   = {TRACE: A Unified Deterministic Metric for Multi-turn
             Conversational Consistency in Large Language Models},
  author  = {Girinath.V},
  year    = {2026}
}

Author: Girinath V GitHub: https://github.com/Giri530/trace-score

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_score-0.1.1.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trace_score-0.1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file trace_score-0.1.1.tar.gz.

File metadata

  • Download URL: trace_score-0.1.1.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.1.tar.gz
Algorithm Hash digest
SHA256 08183efbab11ed5e4defe1bf85386af14133d9e71f551b9fd4b3424e4031ce75
MD5 18837c5f9d44b5970d96e988f27ef541
BLAKE2b-256 cc627ec34d644e2b29643f5356ac7c9803819d4fead23dd32695c2008add7940

See more details on using hashes here.

File details

Details for the file trace_score-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: trace_score-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 91325f43049fa2d3a367a9bf55343979b753fcfec0ee81528a527b3282c6d44e
MD5 c583138efab44eeecf839491ea9002a6
BLAKE2b-256 97ce0a51ae620e444397a2a1e19349fd87bc8acf97b63555d1f0ef2b50aae76f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page