Multi-turn LLM Conversation Consistency Metric

These details have not been verified by PyPI

Project links

Project description

TRACE Score

Multi-turn LLM Conversation Consistency Metric

The first unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.

The Problem

Existing metrics (BLEU, ROUGE, BERTScore, RAGAS) evaluate each conversation turn in isolation. They cannot detect failures that only become visible across multiple turns:

Failure Type	Example	BLEU	ROUGE	BERTScore	TRACE
Fact forgotten	User says "I am diabetic" → model recommends sugar-rich food 5 turns later	Miss	Miss	Miss	Catch
Correction ignored	User corrects model → model reverts to old behavior	Miss	Miss	Miss	Catch
Self-contradiction	Model says X at turn 2, contradicts X at turn 7	Miss	Miss	Miss	Catch
Topic drift	Conversation gradually drifts off-topic	Miss	Miss	Miss	Catch
Confidence drift	Model says "definitely" then "perhaps" about same claim	Miss	Miss	Miss	Catch

Formula

TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)

Each component uses time-decay aggregation — recent turns weighted more:

Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
Z  = Σ γ^(N-t)

Symbol	Component	Measures
T	Temporal Retention	Did assistant remember user-stated facts?
R	Reliability Consistency	Did assistant contradict itself?
A	Adaptive Correction	Did assistant retain user corrections?
C	Context Coherence	Did conversation stay on topic?
E	Epistemic Stability	Did confidence stay calibrated?
P	Contradiction penalty	Global contradiction rate
V	Variance penalty	Confidence variance
γ	Time decay factor	Default: 0.80
λ	Contradiction weight	Default: 0.15
δ	Variance weight	Default: 0.10
α	T·C interaction	Default: 0.05
β	A·R interaction	Default: 0.05

Install

pip install trace-score

Quick Start

from trace_score import compute_TRACE

conversation = [
    ("user",      "I am diabetic and hate spicy food"),
    ("assistant", "I will suggest low sugar mild options."),
    ("user",      "Actually I eat fish too. I am pescatarian."),
    ("assistant", "Spicy chicken with cashews!"),   # failure turn
]

result = compute_TRACE(conversation, verbose=True)

print(result["trace_score"])        # 0.41 — catches failures
print(result["T"])                  # 0.50 — forgot user facts
print(result["A"])                  # 0.00 — ignored correction
print(result["formula_breakdown"])  # full formula with values
print(result["interpretation"])     # "Poor consistency"

Batch Evaluation

from trace_score import TRACEEvaluator

# Models loaded once, reused across all calls — much faster
evaluator = TRACEEvaluator()
results   = [evaluator.evaluate(conv) for conv in conversations]

Adaptive Weights

# Equal weights (default)
result = compute_TRACE(conv, preset="equal")

# Medical chatbot — memory and reliability weighted more
result = compute_TRACE(conv, preset="medical_chatbot")

# Custom weights — must sum to 1.0
result = compute_TRACE(conv, weights={
    "w_T": 0.35, "w_R": 0.25,
    "w_A": 0.20, "w_C": 0.10, "w_E": 0.10
})

Available presets: equal, customer_service, technical_qa, medical_chatbot, education_tutor

Benchmark Results

Evaluated on 30 multi-turn conversations across 3 categories (Fact Memory, Correction Retention, Contradiction Detection). Conversations generated by Llama-3.1-8B via Groq API.

Overall Metric Comparison

Metric	Overall	Fact Memory	Correction	Contradiction
TRACE	0.699	0.703	0.550	0.843
BLEU	0.102	0.046	0.149	0.110
ROUGE-L	0.239	0.177	0.301	0.239
BERTScore	0.822	0.800	0.842	0.823

Key finding: BLEU and ROUGE-L show similar low scores across all categories — they cannot distinguish between different types of consistency failures. BERTScore appears high but provides no diagnostic breakdown. TRACE clearly separates Correction (0.550) from Contradiction (0.843), revealing that Llama-3.1-8B struggles most with retaining user corrections.

TRACE Component Breakdown by Category

Category	T	R	A	C	E
Fact Memory	0.137	0.955	1.000	0.503	0.697
Correction	0.491	0.927	0.144	0.465	0.712
Contradiction	0.973	0.875	0.900	0.510	0.696

Diagnostic insight:

Fact Memory: T=0.137 — model forgets user-stated facts (A=1.0 means no corrections were needed, so A is vacuously true here)
Correction: A=0.144 — model ignores user corrections (critical failure)
Contradiction: T=0.973, A=0.900 — model handles these well

No existing metric (BLEU, ROUGE, BERTScore) can produce this breakdown.

The Gap TRACE Reveals — BERTScore vs TRACE

Conversations where BERTScore is high but TRACE is low (failures invisible to BERTScore, caught by TRACE):

Conversation	Category	TRACE	BERTScore	Gap
CR_006	Correction	0.314	0.876	+0.562
CR_009	Correction	0.381	0.861	+0.480
CR_004	Correction	0.535	0.884	+0.349
CR_003	Correction	0.494	0.864	+0.370
CR_002	Correction	0.442	0.822	+0.380

In all 5 cases: BERTScore ≥ 0.82 (looks good), TRACE < 0.55 (failures detected). The A component reveals why — user corrections completely ignored (A=0.00). This is invisible to any per-turn metric.

Why TRACE?

Metric	Multi-turn	Reference-free	Deterministic	Time-decay	Diagnostic
BLEU	No	No	Yes	No	No
ROUGE	No	No	Yes	No	No
BERTScore	No	No	Yes	No	No
RAGAS	No	Yes	No	No	Partial
TRACE	Yes	Yes	Yes	Yes	Yes

Models Used

Model	Purpose	Size
`all-MiniLM-L6-v2`	Semantic similarity (T, A, C, E)	80MB
`cross-encoder/nli-deberta-v3-small`	Contradiction detection (R, A)	184MB

Models downloaded automatically on first use (~264MB total). CPU-friendly — no GPU required.

Citation

@article{girinathv2026trace,
  title   = {TRACE: A Unified Deterministic Metric for Multi-turn
             Conversational Consistency in Large Language Models},
  author  = {Girinath, V},
  year    = {2026}
}

Author: Girinath V GitHub: https://github.com/Giri530/trace-score

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Apr 15, 2026

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_score-0.1.0.tar.gz (15.7 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trace_score-0.1.0-py3-none-any.whl (15.4 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file trace_score-0.1.0.tar.gz.

File metadata

Download URL: trace_score-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c8ff969e4f38e0784ba980ba9d1dd2a3f1b6cb840958919371c402aa9c7c7c52`
MD5	`0e15e3b0b704320c112250e3f6ce15bc`
BLAKE2b-256	`d7b08e04d158afed782a363bbf3009cbd0973a6747ba80403c1d5da9bf9195d0`

See more details on using hashes here.

File details

Details for the file trace_score-0.1.0-py3-none-any.whl.

File metadata

Download URL: trace_score-0.1.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 15.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`defb7db98193571c79a550cf64e7343d4fcf189d386f50647e4741beab775d57`
MD5	`1dc24ec35710b3710fa04350daf16cdf`
BLAKE2b-256	`0c8ac0c2fc272df29e30a408563d4e7ab4a18d3b8d7feccace2b10af9eab79c3`

See more details on using hashes here.

trace-score 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TRACE Score

The Problem

Formula

Install

Quick Start

Batch Evaluation

Adaptive Weights

Benchmark Results

Overall Metric Comparison

TRACE Component Breakdown by Category

The Gap TRACE Reveals — BERTScore vs TRACE

Why TRACE?

Models Used

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes