Skip to main content

Multi-turn LLM Conversation Consistency Metric

Project description

TRACE Score

Multi-turn LLM Conversation Consistency Metric

The first unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.

PyPI version License: MIT Python 3.8+


The Problem

Existing metrics (BLEU, ROUGE, BERTScore, RAGAS) evaluate each conversation turn in isolation. They cannot detect failures that only become visible across multiple turns:

Failure Type Example BLEU ROUGE BERTScore TRACE
Fact forgotten User says "I am diabetic" → model recommends sugar-rich food 5 turns later Miss Miss Miss Catch
Correction ignored User corrects model → model reverts to old behavior Miss Miss Miss Catch
Self-contradiction Model says X at turn 2, contradicts X at turn 7 Miss Miss Miss Catch
Topic drift Conversation gradually drifts off-topic Miss Miss Miss Catch
Confidence drift Model says "definitely" then "perhaps" about same claim Miss Miss Miss Catch

Formula

TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)

Each component uses time-decay aggregation — recent turns weighted more:

Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
Z  = Σ γ^(N-t)
Symbol Component Measures
T Temporal Retention Did assistant remember user-stated facts?
R Reliability Consistency Did assistant contradict itself?
A Adaptive Correction Did assistant retain user corrections?
C Context Coherence Did conversation stay on topic?
E Epistemic Stability Did confidence stay calibrated?
P Contradiction penalty Global contradiction rate
V Variance penalty Confidence variance
γ Time decay factor Default: 0.80
λ Contradiction weight Default: 0.15
δ Variance weight Default: 0.10
α T·C interaction Default: 0.05
β A·R interaction Default: 0.05

Install

pip install trace-score

Quick Start

from trace_score import compute_TRACE

conversation = [
    ("user",      "I am diabetic and hate spicy food"),
    ("assistant", "I will suggest low sugar mild options."),
    ("user",      "Actually I eat fish too. I am pescatarian."),
    ("assistant", "Spicy chicken with cashews!"),   # failure turn
]

result = compute_TRACE(conversation, verbose=True)

print(result["trace_score"])        # 0.41 — catches failures
print(result["T"])                  # 0.50 — forgot user facts
print(result["A"])                  # 0.00 — ignored correction
print(result["formula_breakdown"])  # full formula with values
print(result["interpretation"])     # "Poor consistency"

Batch Evaluation

from trace_score import TRACEEvaluator

# Models loaded once, reused across all calls — much faster
evaluator = TRACEEvaluator()
results   = [evaluator.evaluate(conv) for conv in conversations]

Adaptive Weights

# Equal weights (default)
result = compute_TRACE(conv, preset="equal")

# Medical chatbot — memory and reliability weighted more
result = compute_TRACE(conv, preset="medical_chatbot")

# Custom weights — must sum to 1.0
result = compute_TRACE(conv, weights={
    "w_T": 0.35, "w_R": 0.25,
    "w_A": 0.20, "w_C": 0.10, "w_E": 0.10
})

Available presets: equal, customer_service, technical_qa, medical_chatbot, education_tutor


Benchmark Results

Evaluated on 30 multi-turn conversations across 3 categories (Fact Memory, Correction Retention, Contradiction Detection). Conversations generated by Llama-3.1-8B via Groq API.

Overall Metric Comparison

Metric Overall Fact Memory Correction Contradiction
TRACE 0.699 0.703 0.550 0.843
BLEU 0.102 0.046 0.149 0.110
ROUGE-L 0.239 0.177 0.301 0.239
BERTScore 0.822 0.800 0.842 0.823

Key finding: BLEU and ROUGE-L show similar low scores across all categories — they cannot distinguish between different types of consistency failures. BERTScore appears high but provides no diagnostic breakdown. TRACE clearly separates Correction (0.550) from Contradiction (0.843), revealing that Llama-3.1-8B struggles most with retaining user corrections.


TRACE Component Breakdown by Category

Category T R A C E
Fact Memory 0.137 0.955 1.000 0.503 0.697
Correction 0.491 0.927 0.144 0.465 0.712
Contradiction 0.973 0.875 0.900 0.510 0.696

Diagnostic insight:

  • Fact Memory: T=0.137 — model forgets user-stated facts (A=1.0 means no corrections were needed, so A is vacuously true here)
  • Correction: A=0.144 — model ignores user corrections (critical failure)
  • Contradiction: T=0.973, A=0.900 — model handles these well

No existing metric (BLEU, ROUGE, BERTScore) can produce this breakdown.


The Gap TRACE Reveals — BERTScore vs TRACE

Conversations where BERTScore is high but TRACE is low (failures invisible to BERTScore, caught by TRACE):

Conversation Category TRACE BERTScore Gap
CR_006 Correction 0.314 0.876 +0.562
CR_009 Correction 0.381 0.861 +0.480
CR_004 Correction 0.535 0.884 +0.349
CR_003 Correction 0.494 0.864 +0.370
CR_002 Correction 0.442 0.822 +0.380

In all 5 cases: BERTScore ≥ 0.82 (looks good), TRACE < 0.55 (failures detected). The A component reveals why — user corrections completely ignored (A=0.00). This is invisible to any per-turn metric.


Why TRACE?

Metric Multi-turn Reference-free Deterministic Time-decay Diagnostic
BLEU No No Yes No No
ROUGE No No Yes No No
BERTScore No No Yes No No
RAGAS No Yes No No Partial
TRACE Yes Yes Yes Yes Yes

Models Used

Model Purpose Size
all-MiniLM-L6-v2 Semantic similarity (T, A, C, E) 80MB
cross-encoder/nli-deberta-v3-small Contradiction detection (R, A) 184MB

Models downloaded automatically on first use (~264MB total). CPU-friendly — no GPU required.


Citation

@article{girinathv2026trace,
  title   = {TRACE: A Unified Deterministic Metric for Multi-turn
             Conversational Consistency in Large Language Models},
  author  = {Girinath, V},
  year    = {2026}
}

Author: Girinath V GitHub: https://github.com/Giri530/trace-score

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_score-0.1.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trace_score-0.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file trace_score-0.1.0.tar.gz.

File metadata

  • Download URL: trace_score-0.1.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c8ff969e4f38e0784ba980ba9d1dd2a3f1b6cb840958919371c402aa9c7c7c52
MD5 0e15e3b0b704320c112250e3f6ce15bc
BLAKE2b-256 d7b08e04d158afed782a363bbf3009cbd0973a6747ba80403c1d5da9bf9195d0

See more details on using hashes here.

File details

Details for the file trace_score-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trace_score-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 defb7db98193571c79a550cf64e7343d4fcf189d386f50647e4741beab775d57
MD5 1dc24ec35710b3710fa04350daf16cdf
BLAKE2b-256 0c8ac0c2fc272df29e30a408563d4e7ab4a18d3b8d7feccace2b10af9eab79c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page