Multi-turn LLM Conversation Consistency Metric

These details have not been verified by PyPI

Project links

Project description

TRACE Score

Multi-turn LLM Conversation Consistency Metric

The unified, deterministic, reference-free evaluation metric for multi-turn conversational consistency in Large Language Models.

The Problem

BLEU, ROUGE, BERTScore, and RAGAS evaluate each conversation turn in isolation. They cannot detect failures that span multiple turns:

Failure Type	Example	BLEU	ROUGE	BERTScore	TRACE
Fact forgotten	User says "I am diabetic" at turn 1 → model recommends sugary food at turn 6	Miss	Miss	Miss	Catch
Correction ignored	User corrects model → model reverts to old behavior next turn	Miss	Miss	Miss	Catch
Self-contradiction	Model says X at turn 2, contradicts X at turn 7	Miss	Miss	Miss	Catch
Topic drift	Conversation drifts off topic over multiple turns	Miss	Miss	Miss	Catch

Formula

TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)

Time-decay aggregation weights recent turns more heavily:

Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ

Component	Measures
T — Temporal Retention	Did the model remember user-stated facts?
R — Reliability Consistency	Did the model contradict itself?
A — Adaptive Correction	Did the model retain user corrections?
C — Context Coherence	Did the conversation stay on topic?
E — Epistemic Stability	Did the model's confidence stay calibrated?

Default: γ=0.80, λ=0.15, δ=0.10, α=0.05, β=0.05, all wᵢ=0.20

Install

pip install trace-score

Quick Start

from trace_score import compute_TRACE

conversation = [
    ("user",      "I am diabetic and allergic to nuts."),
    ("assistant", "I will suggest safe low-sugar options."),
    ("user",      "Actually I eat fish too. I am pescatarian."),
    ("assistant", "Spicy chicken with cashews would be great!"),
]

result = compute_TRACE(conversation, verbose=True)

print(result["trace_score"])        # 0.41
print(result["A"])                  # 0.00 — correction ignored
print(result["interpretation"])     # Poor consistency
print(result["formula_breakdown"])

Batch Evaluation

from trace_score import TRACEEvaluator

evaluator = TRACEEvaluator()   # models loaded once
results   = [evaluator.evaluate(conv) for conv in conversations]

Benchmark Results

Evaluated on 102 multi-turn conversations (34 templates × 3 runs) generated by Llama-3.1-8B via Groq API.

Overall Metric Comparison

Category	n	TRACE	BLEU	ROUGE-L	BERTScore
Fact Memory	36	0.688	0.048	0.172	0.796
Correction	36	0.632	0.183	0.321	0.840
Contradiction	30	0.871	0.124	0.255	0.837
Overall	102	0.721	0.108	0.236	0.822

TRACE category separation range: 0.239 BERTScore category separation range: 0.044 TRACE separates 5.4x more than BERTScore across categories.

TRACE Component Breakdown by Category

Category	T	R	A	C	E
Fact Memory	0.137	0.955	1.000	0.503	0.697
Correction	0.491	0.927	0.144	0.465	0.712
Contradiction	0.973	0.875	0.900	0.510	0.696

The A component (Adaptive Correction) drops to 0.144 for Correction conversations, revealing that Llama-3.1-8B ignores user corrections 85.6% of the time. BERTScore scores the same conversations at 0.840. This failure is invisible to all per-turn metrics.

The Gap TRACE Reveals

Conversations where BERTScore is high but TRACE is low:

Category	TRACE	BERTScore	Gap
Correction	0.314	0.876	0.562
Correction	0.381	0.861	0.480
Correction	0.442	0.822	0.380
Correction	0.494	0.864	0.370
Correction	0.535	0.884	0.349

In all cases, A=0.00 — the model acknowledged the correction but failed to retain it. BERTScore sees fluent per-turn outputs and reports high scores. TRACE sees the cross-turn failure.

Human Evaluation

102 conversations rated by 3 annotators on 5 consistency dimensions (Q1 Memory, Q2 No-Contradiction, Q3 Correction, Q4 Coherence, Q5 Overall, scale 1-5).

Annotator	n	Q1	Q2	Q3	Q4	Q5
Girinath V	34	4.50	4.41	4.35	4.47	4.35
Hari V	34	4.09	4.29	4.26	4.26	4.21
Kaarthic VR	34	4.85	4.88	4.94	4.97	4.85
Combined	102	4.48	4.53	4.52	4.57	4.47

Human overall mean: 4.47/5 (0.868 normalized to [0,1]).

Why TRACE?

Metric	Multi-turn	Reference-free	Deterministic	Time-decay	Diagnostic
BLEU	No	No	Yes	No	No
ROUGE	No	No	Yes	No	No
BERTScore	No	No	Yes	No	No
RAGAS	No	Yes	No	No	Partial
TRACE	Yes	Yes	Yes	Yes	Yes

Models Used

Model	Purpose	Size
all-MiniLM-L6-v2	Semantic similarity (T, A, C, E)	80MB
cross-encoder/nli-deberta-v3-small	Contradiction detection (R, A)	184MB

Models download automatically on first use. CPU-friendly, no GPU required.

Citation

@article{girinathv2026trace,
  title   = {TRACE: A Unified Deterministic Metric for Multi-turn
             Conversational Consistency in Large Language Models},
  author  = {Girinath.V},
  year    = {2026}
}

Author: Girinath V GitHub: https://github.com/Giri530/trace-score

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 15, 2026

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trace_score-0.1.1.tar.gz (14.8 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trace_score-0.1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file trace_score-0.1.1.tar.gz.

File metadata

Download URL: trace_score-0.1.1.tar.gz
Upload date: Apr 15, 2026
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`08183efbab11ed5e4defe1bf85386af14133d9e71f551b9fd4b3424e4031ce75`
MD5	`18837c5f9d44b5970d96e988f27ef541`
BLAKE2b-256	`cc627ec34d644e2b29643f5356ac7c9803819d4fead23dd32695c2008add7940`

See more details on using hashes here.

File details

Details for the file trace_score-0.1.1-py3-none-any.whl.

File metadata

Download URL: trace_score-0.1.1-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 14.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for trace_score-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91325f43049fa2d3a367a9bf55343979b753fcfec0ee81528a527b3282c6d44e`
MD5	`c583138efab44eeecf839491ea9002a6`
BLAKE2b-256	`97ce0a51ae620e444397a2a1e19349fd87bc8acf97b63555d1f0ef2b50aae76f`

See more details on using hashes here.

trace-score 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TRACE Score

The Problem

Formula

Install

Quick Start

Batch Evaluation

Benchmark Results

Overall Metric Comparison

TRACE Component Breakdown by Category

The Gap TRACE Reveals

Human Evaluation

Why TRACE?

Models Used

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes