A python library for spectral-zone-level explanations in machine learning models trained on spectral data (XRF, GRSm Raman, etc.)
Project description
SMX
This is the official repository for the spectral-model-explainer (SMX) library, an eXplainable AI tool designed to provide explanations for machine learning models trained on spectral data (e.g., XRF, GRS, Raman, and related modalities).
SMX is a post-hoc, global, model-agnostic framework that explains spectral-based ML classifiers directly in terms of expert-informed spectral zones. It aggregates each zone via PCA, formulates quantile-based logical predicates, estimates their relevance through perturbation experiments within stochastic subsamples, and integrates the results into a directed weighted graph whose global structure is summarized by Local Reaching Centrality. A distinctive feature is threshold spectrum reconstruction, which back-projects each predicate's decision boundary into the original spectral domain in natural measurement units, enabling practitioners to visually compare their spectra against the model-related boundaries.
Method Overview in the Library
The high-level workflow is implemented in the SMX pipeline class and can also be executed component-by-component through the public API:
- spectral zone extraction
- zone aggregation (typically PCA-based)
- predicate generation from quantiles
- bagging-based robustness evaluation
- predicate relevance scoring
- directed graph construction
- centrality-based ranking and optional mapping back to natural scale
This implementation allows both:
- end-to-end execution through a single pipeline object
- advanced control through direct use of dedicated classes/functions
Spectral Zone Construction
The method starts by partitioning the spectral axis into zones using extract_spectral_zones. Input spectra are expected as a DataFrame in which columns represent numeric spectral positions (energies, wavelengths, channels, etc.).
How zones must be provided
The cuts argument accepts multiple valid formats:
(start, end)(name, start, end)(name, start, end, group){name, start, end}{name, start, end, group}
Important behavior:
- boundaries are interpreted numerically and inclusively
- if
start > end, the library automatically reorders them - grouped cuts (same
group) are concatenated into one merged zone - non-grouped cuts are kept as independent zones
This flexibility enables both physically meaningful elemental regions and composite regions such as aggregated background segments.
Predicate Construction from Zone Scores
After extraction, each zone is transformed into one scalar score per sample (default strategy: PC1 score via ZoneAggregator(method="pca")). These zone-level summaries are the basis for predicate generation.
PredicateGenerator creates binary threshold predicates from a user-defined set of quantiles. For each zone and each quantile value q, two complementary predicates are produced:
zone <= threshold(q)zone > threshold(q)
Therefore, if k quantiles are provided, the initial candidate set is 2k predicates per zone (before duplicate removal). Duplicate rules are automatically removed when quantiles collapse to identical threshold values.
Bagging and Robustness Hyperparameters
SMX estimates predicate robustness through repeated bagging cycles. In the high-level pipeline, this is controlled primarily by:
n_bags: number of bags generated per repetition (seed)n_repetitions: number of independent repetitions (seed loop)n_samples_fraction: fraction of samples drawn in each bagquantiles: quantile grid that defines predicate thresholds
Operationally:
- each repetition creates a new random context for bag generation
- each bag evaluates which predicates are sufficiently supported by sampled data
- predicates with very low support in a bag are discarded for that bag
- final rankings are aggregated across valid repetitions to reduce seed sensitivity
This design makes the explanation less dependent on a single random split and more representative of stable decision behavior.
Predicate Relevance and Graph Construction
Within each bag, predicates are ranked by an importance metric based on perturbation experiments:
- perturbation-based relevance (
PerturbationMetric), using a fitted estimator
PredicateGraphBuilder then constructs a directed graph from ranked predicates:
- consecutive predicates in a ranking induce directed edges
- edge weights are accumulated across bags
- terminal class nodes are linked from last predicates in each path
- bidirectional conflicts are resolved by keeping the stronger direction (ties are randomized)
- edge weighting can incorporate zone-level explained variance from PCA (
var_exp=True), which constrains the graph structure to reflect both predictive relevance and variance importance of zones
Finally, the graph is summarized through Local Reaching Centrality (LRC), producing a ranked list of influential predicates/zones. Accordngly, the final output is a DataFrame with predicates ranked by their LRC scores, along with their corresponding natural-scale thresholds and zone information. This allows practitioners to identify which spectral zones and thresholds are most influential in the model's decision-making process, providing insights into the underlying spectral features driving predictions. Beyond identifying relevant zones, the predicate's threshold values themselves live in PCA space and are back-projected to the original domain as per-zone multivariate thresholds that can be overlaid on measured spectra, translating an abstract condition into a physically readable boundary. Thus, SMX goes beyond numerical importances by delivering condition-aware, subset-aware explanations that support validation, hypothesis generation, and more actionable domain conclusions.
Model Compatibility Note
At the current stage, SMX is primarily designed for use with scikit-learn-style estimators. In practical terms, this means that when the perturbation-based relevance strategy is employed, the estimator passed to the pipeline is expected to be already fitted and to expose the standard prediction interface required by the selected perturbation metric.
More specifically, the minimum requirement is a valid predict method. In addition, some perturbation metrics require richer interfaces: probability_shift requires predict_proba, while decision_function_shift requires decision_function. Consequently, any model class that follows this contract can be integrated in a technically consistent manner, independently of the specific learning algorithm (for example, SVMs, tree ensembles, linear models, and related scikit-learn-compatible estimators).
Ongoing development is focused on extending this compatibility layer beyond the current scikit-learn-centric workflow, with the objective of supporting additional model ecosystems and API styles in Python while preserving methodological consistency and interpretability guarantees.
Installation and Optional Plotting Dependency
SMX is intentionally distributed with a lightweight core dependency set, where visualization is treated as an optional capability rather than a mandatory runtime requirement. This design ensures that users interested exclusively in methodological analysis (zone extraction, predicate construction, bagging, graph construction, and centrality-based ranking) can install and execute the framework without incurring additional graphical dependencies.
Base installation:
pip install spectral-model-explainer
Installation with plotting support:
pip install "spectral-model-explainer[plotting]"
In practical terms, the plotting extra enables functions that generate interactive visual outputs (for example, threshold-spectrum overlays used to inspect reconstructed multivariate decision boundaries in the natural spectral domain). The analytical SMX pipeline remains fully functional without this extra.
If plotting routines are invoked in an environment where the plotting extra has not been installed, SMX raises an explicit import-related error with installation guidance. This behavior is intentional: it preserves minimal installation overhead for non-visual workflows while providing clear and immediate feedback when visualization features are requested.
Easy Usage
import pandas as pd
from sklearn.svm import SVC
from smx import SMX
# X_cal_prep: preprocessed calibration spectra (DataFrame)
# X_cal_natural: original calibration spectra before preprocessing (DataFrame)
# y_cal_labels: class labels for calibration samples (Series)
spectral_cuts = [
("F1", 1.0, 100.0),
("background", 100.0, 200.0, "background_group"),
("F2", 200.0, 300.0),
]
model = SVC(kernel="rbf", probability=True, random_state=42)
model.fit(X_cal_prep, y_cal_labels)
# Example: probability of the first class as continuous output
y_pred_cal = model.predict_proba(X_cal_prep)[:, 0]
smx = SMX(
spectral_cuts=spectral_cuts,
quantiles=[0.25, 0.50, 0.75],
n_repetitions=4,
n_bags=10,
n_samples_fraction=0.8,
replace=False,
metric="perturbation",
estimator=model,
perturbation_mode="median",
perturbation_metric="probability_shift",
)
smx.fit(X_cal_prep, y_pred_cal, X_cal_natural=X_cal_natural)
# Main result (ranked predicates with natural-scale thresholds)
results = smx.lrc_natural_
print(results.head())
For a complete, executable walkthrough with synthetic data and visualization outputs, see the quickstart notebook:
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spectral_model_explainer-0.1.1.tar.gz.
File metadata
- Download URL: spectral_model_explainer-0.1.1.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.3 CPython/3.13.2 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f50d182368a1b4afd32fa79166cb37dc9c150c5e29e1bdd965a7aed44bd8540
|
|
| MD5 |
bd76b4a569389fa4e3b9d6f400f87355
|
|
| BLAKE2b-256 |
f64c6c3a18ef7e8b4da451b6e81a51f74791de4f9ffda32b6076f1e1a474360d
|
File details
Details for the file spectral_model_explainer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: spectral_model_explainer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.3 CPython/3.13.2 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f00943597ac3261f13c84bc2de45fc9f89b603ab28b19e0496392dd99072fea0
|
|
| MD5 |
8b38c0e38a7fb056294910a6ce06dbfb
|
|
| BLAKE2b-256 |
c637173ef319c6e19e97df2228bdd9b799bc86c91d1e69356bbe2e16b010a276
|