Skip to main content

TSCFEval - A model-agnostic framework for systematic evaluation of counterfactual explanations in time series classification

Project description

TSCFEval

PyPI version Python 3.10-3.13 License: MIT

TSCFEval is a model-agnostic Python framework for systematic evaluation of counterfactual explanations in Time Series Classification (TSC). Unlike existing libraries that focus on counterfactual generation, TSCFEval is specifically designed for counterfactual evaluation, consolidating fragmented evaluation practices from the TSC counterfactual literature into a unified, extensible toolkit.

Given a time series classifier and a set of counterfactual explanations, TSCFEval provides:

  • 11 evaluation metrics organized into six quality dimensions (core quality, distribution alignment, structural properties, model behavior, stability, and computational performance)
  • Weighted scalarization for aggregating metrics into composite scores, enabling customizable method ranking
  • Confidence-stratified instance selection for benchmarking across the decision boundary
  • Three benchmarking scenarios: single dataset with multiple CF methods, single dataset with multiple classifiers, and multiple datasets with a fixed classifier
  • 7 built-in CF methods for generating counterfactuals
  • Pareto and Friedman analysis for principled multi-criteria comparison

Table of Contents


Installation

pip install tscf-eval

With optional dependencies:

pip install tscf-eval[dtw]   # DTW distance support
pip install tscf-eval[full]  # All features

From source:

git clone https://github.com/bzamith/tscf-eval.git
cd tscf-eval && pip install -e .

Available Methods and Metrics

Counterfactual Methods

Method Strategy Description Reference
CELS Saliency map Learned saliency map blending with nearest unlike neighbor Li et al., 2023
NativeGuide Instance-based Nearest unlike neighbor guidance with four variants (blend, ng, dtw_dba, cam) Delaney et al., 2021
COMTE Instance-based Greedy channel substitution for multivariate TS Ates et al., 2021
SETS Shapelet-based Class-specific shapelet manipulation with contiguous perturbations Bahri et al., 2022
TSEvo Evolutionary Multi-objective optimization via NSGA-II with three mutation operators Hollig et al., 2022
Glacier Gradient-based Gradient optimization with importance-weighted proximity constraints Wang et al., 2024
LatentCF Gradient-based Latent space optimization with local/global importance weighting Wang et al., 2021

Evaluation Metrics

TSCFEval implements 11 metrics organized into six quality dimensions:

Dimension Metric Description Direction Reference
Core Quality Validity Fraction of CFs that flip the prediction (hard or soft mode) maximize Li et al., 2023
Proximity Closeness to original instance (L1, L2, L-inf, DTW) maximize Delaney et al., 2021; Bahri et al., 2022
Sparsity Fraction of changed features minimize Mothilal et al., 2020
Distribution Plausibility Whether CFs lie within data distribution (LOF, IF, MP-OCSVM, DTW-LOF) maximize Breunig et al., 2000; Liu et al., 2008
Diversity Variety among multiple CFs via DPP (Euclidean or DTW) maximize Mothilal et al., 2020
Structure Contiguity How contiguous the edits are maximize Delaney et al., 2021; Ates et al., 2021
Composition Number and length of edit segments minimize Delaney et al., 2021; Ates et al., 2021
Model Behavior Confidence Model confidence on original and CF predictions maximize Le et al., 2023
Controllability Ease of reverting CF changes via single-feature edits maximize Verma et al., 2024
Stability Robustness Local Lipschitz-like stability to input perturbations (Euclidean or DTW) minimize Ates et al., 2021
Performance Efficiency Generation time per instance minimize Li et al., 2023

Evaluating Counterfactuals

Simple Evaluation

from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import (
    Evaluator, Validity, Proximity, Sparsity,
    UCRLoader, NativeGuide,
)

# Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")

# Train classifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)

# Generate counterfactuals using NativeGuide
explainer = NativeGuide(clf, (train.X, train.y), method="blend")
X, X_cf, y, y_cf = [], [], [], []
for x in test.X[:10]:
    cf, cf_label, _ = explainer.explain(x)
    X.append(x)
    X_cf.append(cf)
    y.append(clf.predict(x.reshape(1, -1))[0])
    y_cf.append(cf_label)

# Evaluate counterfactual quality
evaluator = Evaluator([
    Validity(),
    Proximity(p=2, distance="lp"),
    Proximity(distance="dtw"),
    Sparsity(),
])
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf)
print(f"Validity: {results['validity_soft']:.2f}")
print(f"Proximity (L2): {results['proximity_l2']:.2f}")
print(f"Proximity (DTW): {results['proximity_dtw']:.2f}")
print(f"Sparsity: {results['sparsity']:.2f}")

Using Built-in CF Methods

from tscf_eval import COMTE, NativeGuide
from tscf_eval.data_loader import UCRLoader
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
loader = UCRLoader("ItalyPowerDemand")
train = loader.load("train")
test = loader.load("test")

# Train classifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(train.X, train.y)

# Generate counterfactual with CoMTE
comte = COMTE(
    model=clf,
    data=(train.X, train.y),
    distance="dtw",
)

cf, cf_label, meta = comte.explain(test.X[0])
print(f"CF label: {cf_label}, Edits: {meta['edits_variables']}")

Benchmarking Multiple Methods

TSCFEval supports three benchmarking scenarios for systematic evaluation:

  1. Single dataset, multiple CF methods -- compare explainer algorithms on a fixed dataset and model
  2. Single dataset, multiple classifiers -- study how the classifier affects CF quality
  3. Multiple datasets, fixed classifier -- assess generalization across datasets
from tscf_eval import Evaluator, Validity, Proximity, Sparsity, COMTE, NativeGuide, TSEvo
from tscf_eval.benchmark import BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig
from tscf_eval.data_loader import UCRLoader

# Load data
loader = UCRLoader("ArrowHead")
train, test = loader.load("train"), loader.load("test")

# Train a classifier
from aeon.classification.convolution_based import RocketClassifier
clf = RocketClassifier(n_kernels=500, random_state=42)
clf.fit(train.X, train.y)

# Configure benchmark
runner = BenchmarkRunner(
    datasets=[DatasetConfig("ArrowHead", train.X, train.y, test.X, test.y)],
    models=[ModelConfig("rocket", clf)],
    explainers=[
        ExplainerConfig("comte_dtw", COMTE, {"distance": "dtw"}),
        ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
        ExplainerConfig("tsevo", TSEvo, {"n_generations": 50}),
    ],
    evaluator=Evaluator([Validity(), Proximity(p=2, distance="lp"), Proximity(distance="dtw"), Sparsity()]),
    n_instances=12,
    instance_selection="stratified_confidence",  # Confidence-stratified selection
)

results = runner.run()

# Aggregate results by explainer
df = results.aggregate(by="explainer")
print(df)

Pareto Analysis

Find Pareto-optimal methods balancing multiple objectives:

from tscf_eval.benchmark import ParetoAnalyzer

analyzer = ParetoAnalyzer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])

# Dominance ranking
ranking = analyzer.dominance_ranking(results)
print(ranking)

# Visualize trade-offs
analyzer.plot_front(results, x_metric="validity_soft", y_metric="proximity_dtw")

Weighted Scalarization

Since evaluation metrics often conflict (e.g., maximizing validity may reduce proximity), TSCFEval implements weighted sum scalarization for aggregating metrics into a composite score. Each metric is min-max normalized to [0, 1] with direction awareness, then combined via weighted sum:

from tscf_eval.benchmark import WeightedScalarizer

# Equal-weight composite across all metrics
scalarizer = WeightedScalarizer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])
scores = scalarizer.score(results)
print(scores)  # Ranked by composite score

# Custom weights emphasizing validity
scalarizer = WeightedScalarizer(
    metrics=["validity_soft", "proximity_dtw", "sparsity"],
    weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)

# Sensitivity analysis: how ranking changes as one metric's weight varies
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)

Adding Your Own CF Method

To integrate your own counterfactual method with TSCFEval, inherit from the Counterfactual base class and implement the explain method:

from tscf_eval.counterfactuals.base import Counterfactual
import numpy as np

class MyCFMethod(Counterfactual):
    """Custom counterfactual generator."""

    def __init__(self, model, data, my_param=1.0):
        self.model = model
        self.X_ref, self.y_ref = data
        self.my_param = my_param

    def explain(self, x, y_pred=None):
        """Generate counterfactual for instance x.

        Parameters
        ----------
        x : np.ndarray
            Single instance, shape (T,) or (C, T).
        y_pred : int, optional
            Predicted label for x (computed if None).

        Returns
        -------
        cf : np.ndarray
            Counterfactual instance.
        cf_label : int
            Predicted label for the counterfactual.
        meta : dict
            Metadata about the generation process.
        """
        # Your CF generation logic here
        cf = x.copy()
        # ... modify cf ...

        cf_label = int(self.model.predict(cf[None, ...])[0])

        meta = {
            "method": "my_cf_method",
            "param_used": self.my_param,
        }

        return cf, cf_label, meta

Use your method with the evaluator and benchmarking tools:

# Direct evaluation
my_method = MyCFMethod(model=clf, data=(train.X, train.y))
cf, label, meta = my_method.explain(test.X[0])

# In benchmarks
configs = [
    ExplainerConfig("my_method", MyCFMethod, {"my_param": 2.0}),
    ExplainerConfig("comte", COMTE, {}),
]

Citation

If you use TSCFEval in your research, please cite our paper:

Zamith Santos, B., Andrade Lira, M. F., Cerri, R., & Cavalcante Prudêncio, R. B. (2026). TSCFEval: A Model-Agnostic Framework for Evaluating Time Series Classification Counterfactuals. In Explainable Artificial Intelligence. xAI 2026. Communications in Computer and Information Science. Springer, Cham.

@inproceedings{santos2026tscfeval,
  title     = {{TSCFEval}: A Model-Agnostic Framework for Evaluating Time Series Classification Counterfactuals},
  author    = {Zamith Santos, Bruna and Andrade Lira, Maira Farias and Cerri, Ricardo and Cavalcante Prud{\^{e}}ncio, Ricardo Bastos},
  booktitle = {Explainable Artificial Intelligence. xAI 2026},
  series    = {Communications in Computer and Information Science},
  publisher = {Springer, Cham},
  year      = {2026}
}

References

Counterfactual Methods

  • CELS: Li, P., Tang, B., & Ning, Y. (2023). CELS: Counterfactual Explanation of Time-Series via Learned Saliency Maps. IEEE International Conference on Big Data 2023, pp. 1952-1957. [Paper] [Code]

  • CoMTE: Ates, E., Aksar, B., Leung, V. J., & Coskun, A. K. (2021). Counterfactual Explanations for Multivariate Time Series. ICAPAI 2021. [Paper] [Code]

  • NativeGuide: Delaney, E., Greene, D., & Keane, M. T. (2021). Instance-Based Counterfactual Explanations for Time Series Classification. ICCBR 2021. [Paper] [Code]

  • SETS: Bahri, O., Filali Boubrahimi, S., & Hamdi, S. M. (2022). Shapelet-Based Counterfactual Explanations for Multivariate Time Series. KDD-MiLeTS 2022. [Paper] [Code]

  • TSEvo: Hollig, J., Kulbach, C., & Thoma, S. (2022). TSEvo: Evolutionary Counterfactual Explanations for Time Series Classification. ICMLA 2022. [Paper] [Code]

  • Glacier: Wang, Z., Samsten, I., Miliou, I., Mochaourab, R., & Papapetrou, P. (2024). Glacier: Guided Locally Constrained Counterfactual Explanations for Time Series Classification. Machine Learning, 113(3). [Paper] [Code]

  • LatentCF++: Wang, Z., Samsten, I., Mochaourab, R., & Papapetrou, P. (2021). Learning Time Series Counterfactuals via Latent Space Representations. DS 2021. [Paper] [Code]

Evaluation Metrics

  • Validity: Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). CELS: Counterfactual Explanations for Time Series Data via Learned Saliency Maps. IEEE BigData 2023. [Paper]

  • Proximity: Delaney, E., Greene, D., & Keane, M. T. (2021). ICCBR 2021; Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2022). Shapelet-Based Counterfactual Explanations for Multivariate Time Series.

  • Sparsity: Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. FAT* 2020. [Paper]

  • Plausibility (LOF): Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD 2000. [Paper]

  • Plausibility (Isolation Forest): Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. ICDM 2008. [Paper]

  • Diversity: Mothilal, R. K., Sharma, A., & Tan, C. (2020). FAT* 2020; Kulesza, A., & Taskar, B. (2012). Determinantal Point Processes for Machine Learning. [Paper]

  • Composition, Contiguity: Delaney et al. (2021). ICCBR 2021; Ates et al. (2021). ICAPAI 2021.

  • Confidence: Le, T., Miller, T., Singh, R., & Sonenberg, L. (2023). Explaining model confidence using counterfactuals. AAAI 2023. [Paper]

  • Controllability: Verma, S., Boonsanong, V., Hoang, M., Hines, K. E., Dickerson, J. P., & Shah, C. (2024). Counterfactual Explanations for Machine Learning: Challenges Revisited. ACM Computing Surveys, 56(12), Article 304. [Paper]

  • Robustness: Ates et al. (2021). ICAPAI 2021.

  • Efficiency: Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). Attention-Based Counterfactual Explanation for Multivariate Time Series. DaWaK 2023. [Paper]

Data and Tools

  • UCR Archive: Dau, H. A., et al. (2019). The UCR Time Series Archive. IEEE/CAA Journal of Automatica Sinica, 6(6), 1293-1305. [Archive]

  • aeon: Middlehurst, M., et al. (2024). aeon: a Python toolkit for learning from time series. JMLR. [Code]


License

MIT License - see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tscf_eval-1.0.0.tar.gz (119.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tscf_eval-1.0.0-py3-none-any.whl (147.8 kB view details)

Uploaded Python 3

File details

Details for the file tscf_eval-1.0.0.tar.gz.

File metadata

  • Download URL: tscf_eval-1.0.0.tar.gz
  • Upload date:
  • Size: 119.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tscf_eval-1.0.0.tar.gz
Algorithm Hash digest
SHA256 099eae535db1ef31b8bf126f2041c4143f43729e5dcd5d34643870c434e5a11b
MD5 f4b36be13465905fe32a60ed00488175
BLAKE2b-256 189e8ed982d2024508de1499d10e1c44c8a0a6cd214a3d8b05bf0a67445beb97

See more details on using hashes here.

Provenance

The following attestation bundles were made for tscf_eval-1.0.0.tar.gz:

Publisher: release.yml on bzamith/tscf-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tscf_eval-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tscf_eval-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 147.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tscf_eval-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 358ace6fad947c1da8c3a59860a1b5a2b1203bc2ecc589996288f7e961e50647
MD5 42447ff756528cef29791c7693f22d3a
BLAKE2b-256 58f59d2cf90f23c887fb0283d1b872a8804e1810e18f26e6d47a7f66b7e655c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for tscf_eval-1.0.0-py3-none-any.whl:

Publisher: release.yml on bzamith/tscf-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page