Hierarchical Variance-Retaining Transformer (HVRT) — variance-aware sample transformation for tabular data

These details have not been verified by PyPI

Project links

Project description

HVRT: Hierarchical Variance-Retaining Transformer

Variance-aware sample transformation for tabular data: reduce, expand, or augment.

Overview

HVRT partitions a dataset into variance-homogeneous regions via a decision tree fitted on a synthetic extremeness target, then applies a configurable per-partition operation (selection for reduction, sampling for expansion). The tree is fitted once; reduce(), expand(), and augment() all draw from the same fitted model.

Operation	Method	Description
Reduce	`model.reduce(ratio=0.3)`	Select a geometrically diverse representative subset
Expand	`model.expand(n=50000)`	Generate synthetic samples via per-partition KDE or other strategy
Augment	`model.augment(n=15000)`	Concatenate original data with synthetic samples

Algorithm

1. Z-score normalisation

X_z = (X - μ) / σ   per feature

Categorical features are integer-encoded then z-scored.

2. Synthetic target construction

HVRT — sum of normalised pairwise feature interactions:

For all feature pairs (i, j):
  interaction = X_z[:,i] ⊙ X_z[:,j]
  normalised  = (interaction - mean) / std
target = sum of all normalised interaction columns        O(n · d²)

FastHVRT — sum of z-scores per sample:

target_i = Σ_j  X_z[i, j]                               O(n · d)

3. Partitioning

A DecisionTreeRegressor is fitted on the synthetic target. Leaves form variance-homogeneous partitions. Tree depth and leaf size are auto-tuned to dataset size.

4. Per-partition operations

Reduce: Select representatives within each partition using the chosen selection strategy. Budget is proportional to partition size (variance_weighted=False) or biased toward high-variance partitions (variance_weighted=True).

Expand: Draw synthetic samples within each partition using the chosen generation strategy. Budget allocation follows the same logic.

Installation

pip install hvrt

git clone https://github.com/hotprotato/hvrt.git
cd hvrt
pip install -e .

Quick Start

from hvrt import HVRT, FastHVRT

# Fit once — reduce and expand from the same model
model = HVRT(random_state=42).fit(X_train, y_train)   # y optional
X_reduced, idx = model.reduce(ratio=0.3, return_indices=True)
X_synthetic    = model.expand(n=50000)
X_augmented    = model.augment(n=15000)

# FastHVRT — O(n·d) target; preferred for expansion
model = FastHVRT(random_state=42).fit(X_train)
X_synthetic = model.expand(n=50000)

API Reference

`HVRT`

from hvrt import HVRT

model = HVRT(
    n_partitions=None,           # Max tree leaves; auto-tuned if None
    min_samples_leaf=None,       # Min samples per leaf; auto-tuned if None
    y_weight=0.0,                # 0.0 = unsupervised; 1.0 = y drives splits
    bandwidth='auto',            # KDE bandwidth: 'auto' (default), float, 'scott', 'silverman'
    auto_tune=True,
    random_state=42,
    # Pipeline params (see Pipeline section)
    reduce_params=None,
    expand_params=None,
    augment_params=None,
)

Target: sum of normalised pairwise feature interactions. O(n · d²). Preferred for reduction.

`FastHVRT`

from hvrt import FastHVRT

model = FastHVRT(bandwidth='auto', random_state=42)

Target: sum of z-scores. O(n · d). Equivalent quality to HVRT for expansion. All constructor parameters identical to HVRT.

`HVRTOptimizer`

Requires: pip install hvrt[optimizer]

from hvrt import HVRTOptimizer

opt = HVRTOptimizer(
    n_trials=30,             # Optuna trials; use ≥50 in production
    n_jobs=1,                # Parallel trials (-1 = all cores)
    cv=3,                    # Cross-validation folds for the objective
    expansion_ratio=5.0,     # Synthetic-to-real ratio during evaluation
    task='auto',             # 'auto', 'regression', 'classification'
    timeout=None,            # Wall-clock time limit in seconds
    random_state=None,
    verbose=0,               # 0 = silent, 1 = Optuna trial progress
)
opt = opt.fit(X, y)          # y enables TSTR Δ objective; required for classification

Performs TPE-based Bayesian optimisation over n_partitions, min_samples_leaf, y_weight, kernel / bandwidth, and variance_weighted. The HVRT defaults are always evaluated as trial 0 (warm start), so HPO can only match or improve on the baseline.

Post-fit attributes:

Attribute	Type	Description
`best_score_`	float	Best mean TSTR Δ across CV folds
`best_params_`	dict	Best constructor kwargs (`n_partitions`, `min_samples_leaf`, `y_weight`, `bandwidth`)
`best_expand_params_`	dict	Best expand kwargs (`variance_weighted`, optionally `generation_strategy`)
`best_model_`	HVRT	Refitted on the full dataset using `best_params_`
`study_`	optuna.Study	Full Optuna study for visualisation and diagnostics

After fitting:

opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
print(f'Best params: {opt.best_params_}')

X_synth = opt.expand(n=50000)         # y column stripped automatically
X_aug   = opt.augment(n=len(X) * 5)   # originals + synthetic

expand() and augment() strip the appended y column, returning arrays with the same number of columns as the training X.

`fit`

model.fit(X, y=None, feature_types=None)
# feature_types: list of 'continuous' or 'categorical' per column

`reduce`

X_reduced = model.reduce(
    n=None,                  # Absolute target count
    ratio=None,              # Proportional (e.g. 0.3 = keep 30%)
    method='fps',            # Selection strategy; see Selection Strategies
    variance_weighted=True,  # Oversample high-variance partitions
    return_indices=False,
    n_partitions=None,       # Override tree granularity for this call only
)

`expand`

X_synth = model.expand(
    n=10000,
    variance_weighted=False,      # True = oversample tails
    bandwidth=None,               # Override instance bandwidth; accepts float, 'auto', 'scott'
    adaptive_bandwidth=False,     # Scale bandwidth with local expansion ratio
    generation_strategy=None,     # See Generation Strategies
    return_novelty_stats=False,
    n_partitions=None,
)

adaptive_bandwidth=True uses per-partition bandwidth bw_p = scott_p × max(1, budget_p/n_p)^(1/d).

`augment`

X_aug = model.augment(n=15000, variance_weighted=False)
# n must exceed len(X); returns original X concatenated with (n - len(X)) synthetic samples

Utility methods

partitions = model.get_partitions()
# [{'id': 5, 'size': 120, 'mean_abs_z': 0.84, 'variance': 1.2}, ...]

novelty = model.compute_novelty(X_new)   # min z-space distance per point

params = HVRT.recommend_params(X)        # {'n_partitions': 180, ...}

sklearn Pipeline

Operation parameters are declared at construction time via ReduceParams, ExpandParams, or AugmentParams. The tree is fitted once during fit(); transform() calls the corresponding operation.

from hvrt import HVRT, FastHVRT, ReduceParams, ExpandParams, AugmentParams
from sklearn.pipeline import Pipeline

# Reduce
pipe = Pipeline([('hvrt', HVRT(reduce_params=ReduceParams(ratio=0.3)))])
X_red = pipe.fit_transform(X, y)

# Expand
pipe = Pipeline([('hvrt', FastHVRT(expand_params=ExpandParams(n=50000)))])
X_synth = pipe.fit_transform(X)

# Augment
pipe = Pipeline([('hvrt', HVRT(augment_params=AugmentParams(n=15000)))])
X_aug = pipe.fit_transform(X)

Alternatively, import from hvrt.pipeline to make the intent explicit:

from hvrt.pipeline import HVRT, ReduceParams

ReduceParams

ReduceParams(
    n=None,
    ratio=None,              # e.g. 0.3
    method='fps',
    variance_weighted=True,
    return_indices=False,
    n_partitions=None,
)

ExpandParams

ExpandParams(
    n=50000,                 # required
    variance_weighted=False,
    bandwidth=None,
    adaptive_bandwidth=False,
    generation_strategy=None,
    return_novelty_stats=False,
    n_partitions=None,
)

AugmentParams

AugmentParams(
    n=15000,                 # required; must exceed len(X)
    variance_weighted=False,
    n_partitions=None,
)

Generation Strategies

from hvrt import FastHVRT, epanechnikov, univariate_kde_copula

model = FastHVRT(random_state=42).fit(X)

# By name
X_synth = model.expand(n=10000, generation_strategy='epanechnikov')

# By reference
X_synth = model.expand(n=10000, generation_strategy=univariate_kde_copula)

# Custom callable
def my_strategy(X_z, partition_ids, unique_partitions, budgets, random_state):
    ...
    return X_synthetic   # shape (sum(budgets), n_features), z-score space

X_synth = model.expand(n=10000, generation_strategy=my_strategy)

Strategy	Behaviour	Notes
`'multivariate_kde'`	`scipy.stats.gaussian_kde` on all features jointly. Uses instance `bandwidth`.	Captures full joint covariance
`'epanechnikov'`	Product Epanechnikov kernel, Ahrens-Dieter sampling. Bounded support.	Recommended for classification; ≥5× ratios
`'univariate_kde_copula'`	Per-feature 1-D KDE marginals + Gaussian copula.	More flexible per-feature marginals
`'bootstrap_noise'`	Resample with replacement + Gaussian noise at 10% of per-feature std.	Fastest; no distributional assumptions

from hvrt import BUILTIN_GENERATION_STRATEGIES
list(BUILTIN_GENERATION_STRATEGIES)
# ['multivariate_kde', 'univariate_kde_copula', 'bootstrap_noise', 'epanechnikov']

Selection Strategies

from hvrt import HVRT

model = HVRT(random_state=42).fit(X, y)

X_red = model.reduce(ratio=0.2, method='fps')             # default
X_red = model.reduce(ratio=0.2, method='medoid_fps')
X_red = model.reduce(ratio=0.2, method='variance_ordered')
X_red = model.reduce(ratio=0.2, method='stratified')

# Custom callable
def my_selector(X_z, partition_ids, unique_partitions, budgets, random_state):
    ...
    return selected_indices   # global indices into X

X_red = model.reduce(ratio=0.2, method=my_selector)

Strategy	Behaviour
`'fps'` / `'centroid_fps'`	Greedy Furthest Point Sampling seeded at partition centroid. Default.
`'medoid_fps'`	FPS seeded at the partition medoid.
`'variance_ordered'`	Select samples with highest local k-NN variance (k=10).
`'stratified'`	Random sample within each partition.

Recommendations

Findings from a systematic bandwidth and kernel benchmark across 6 datasets, 3 expansion ratios (2×/5×/10×), and 11 methods (see benchmarks/bandwidth_benchmark.py and findings.md).

`bandwidth='auto'` — the default

bandwidth='auto' is the default and requires no tuning for most datasets. At each expand() call it inspects the fitted partition structure and picks the kernel most likely to produce high-quality synthetic data:

model = HVRT().fit(X)          # bandwidth='auto' by default
X_synth = model.expand(n=50000)  # auto chooses at call-time

How it decides:

At call-time, 'auto' computes the mean number of samples per partition and compares it against a feature-scaled threshold: max(15, 2 × n_continuous_features).

Condition	Chosen kernel	Reason
mean partition size ≥ threshold	Narrow Gaussian `h=0.1`	Enough samples for stable multivariate covariance estimation; tight kernel stays within partition geometry
mean partition size < threshold	Epanechnikov product kernel	Too few samples for reliable covariance; product kernel requires no covariance matrix and bounded support keeps samples within the local region

The threshold scales with dimensionality because the minimum samples needed for a non-degenerate d-dimensional covariance matrix grows with d. At 5 features the threshold is 15; at 15 features it is 30.

Why not just always use one or the other:

Benchmarking across 4 regression datasets showed a clean crossover depending on partition size. With the default auto-tuned partition count (typically 15–20 partitions at n=500), partitions hold ~25 samples and narrow Gaussian wins on TSTR. But when partitions are finer — either because the dataset is large and the auto-tuner produces more leaves, or because n_partitions is manually increased — Gaussian KDE degrades as partitions become too small for stable covariance estimation, while Epanechnikov holds steady or improves. For example, on the housing dataset (d=6) at 10× expansion:

Partition count	Gaussian `h=0.1` TSTR	Epanechnikov TSTR
auto (~18)	+0.004	−0.014
50	−0.033	−0.008
100	−0.037	−0.011
200	−0.080	−0.008

The crossover point depends on dimensionality: higher-dimensional datasets shift it earlier. On multimodal (d=10), Epanechnikov wins from 30 partitions onward (mean partition size ~13 at n=500). On housing (d=6) and emergence_divergence (d=5), the crossover is ~50 partitions. This is because higher dimensionality makes a d×d covariance matrix harder to estimate stably from small samples, while Epanechnikov is always covariance-free.

'auto' captures this automatically: when you call expand(n_partitions=200), 'auto' sees the resulting small partition sizes and switches to Epanechnikov without any manual intervention.

When to override 'auto':

Heterogeneous / high-skew classification task (mean |skew| ≳ 0.8): generation_strategy='epanechnikov' directly — Epanechnikov wins consistently when within-partition data is non-Gaussian. On near-Gaussian classification data, bandwidth='auto' (h=0.10) or adaptive_bandwidth=True is competitive or better, particularly at 2×–5× expansion ratios.
Small dataset, coarse partitions, regression: bandwidth=0.1 or bandwidth=0.3 — explicit narrow Gaussian if you know partition sizes are large and correlation structure matters.
Diagnostic / ablation: pass explicit values (bandwidth=0.3, bandwidth='scott') to isolate the bandwidth effect.

Why Scott's rule underperforms

Scott's rule is AMISE-optimal for iid Gaussian data. HVRT partitions, while locally more homogeneous than the global distribution, are not Gaussian enough for this to hold (mean |skewness| 0.49–1.37 across benchmark datasets). More importantly, the decision tree already captures the primary variance structure of each partition, so the residual within-partition variance is narrower than Scott's formula assumes. The result is systematic over-smoothing: synthetic samples bleed across partition boundaries and dilute the local density structure. Scott's rule won 0 of 18 benchmark conditions.

Wide bandwidths (≥ 0.75) are actively harmful. They produce synthetic data that degrades downstream ML models (TSTR Δ as low as −0.75 R²). Discriminator accuracy can paradoxically improve with wide bandwidths on regression — a metric artifact where spreading matches marginals while destroying joint structure. Use TSTR as the primary quality signal, not disc_err.

Partition granularity

If 'auto' is already in use, increasing n_partitions will automatically trigger the switch to Epanechnikov when partition sizes fall below the threshold. You can also set it explicitly:

# Finer partitions — 'auto' will pick Epanechnikov when sizes drop below threshold
model.expand(n=50000, n_partitions=150)

# Or fix at construction time
model = HVRT(n_partitions=150, min_samples_leaf=10).fit(X)

Benchmark evidence (regression datasets, 5×/10× expansion ratios):

Dataset (d)	At auto (~18 parts) best TSTR	At 150 parts Epan TSTR
housing (d=6)	h=0.30: −0.001	−0.013
multimodal (d=10)	h=0.30: +0.004	+0.001
emergence_divergence (d=5)	h=0.10: +0.007	+0.004
emergence_bifurcation (d=5)	h=0.10: −0.022	−0.118

Note: for the emergence_bifurcation dataset (where the same feature region maps to a bimodal target), all methods remain significantly negative at any partition count. This indicates a structural limit: if the same X values correspond to multiple distinct y outcomes, expansion without conditioning on y cannot reproduce that structure. In such cases consider conditioning expansion on y directly (e.g., expand class-conditional subsets separately).

Hyperparameter optimisation (HPO)

Dataset heterogeneity is the primary driver of how sensitive synthetic quality is to HVRT's parameters. A well-behaved, near-Gaussian dataset with few sub-populations produces good synthetic data at defaults with little room to improve. A dataset with distinct clusters, non-linear interactions, or regime-switching needs finer partitions to achieve local homogeneity within each leaf — and the optimal settings are dataset-specific.

Benchmark evidence: on near-Gaussian data (fraud, housing at auto partition count), TSTR varied by less than 0.01 across all bandwidth candidates. On heterogeneous datasets (emergence_divergence, emergence_bifurcation), TSTR varied by up to 0.20+ between the best and worst methods at the same partition count. If your data is heterogeneous, HPO pays; if it is well-behaved, defaults are sufficient.

When HPO is worth running:

TSTR Δ is significantly negative on your downstream task (below −0.05 is a useful rule of thumb)
Your dataset has known sub-populations, clusters, non-linear interactions, or regime changes (e.g., different dynamics at different feature values)
You are generating at a high ratio (10×+) where compounding errors matter more

Parameter search space:

Parameter	Default	Suggested search	Effect
`n_partitions`	auto	`None`, 20, 30, 50, 75, 100	Primary lever. More partitions → finer local homogeneity. Start here.
`min_samples_leaf`	auto	5, 10, 15, 20	Controls auto-tuner floor; lower allows finer splits when n is large.
`bandwidth`	`'auto'`	`'auto'`, 0.05, 0.10, 0.30, `epanechnikov`	`'auto'` is usually near-optimal once partition count is right.
`variance_weighted`	`False`	`True`, `False`	`True` oversamples high-variance partitions; useful for tail-heavy distributions.
`y_weight`	0.0	0.1, 0.3, 0.5	Weights target in synthetic target; helps when y governs sub-population identity.

Evaluation metric: Use TSTR Δ (train-on-synthetic, test-on-real minus train-on-real baseline) as the HPO objective. Discriminator accuracy (disc_err) is structurally insensitive — wide bandwidths can lower it by spreading marginals while destroying joint structure. TSTR directly measures what matters: can a model trained on synthetic data perform as well as one trained on real data?

Example HPO loop:

Use HVRTOptimizer for automated Bayesian optimisation with Optuna (install the optional extra first: pip install hvrt[optimizer]):

from hvrt import HVRTOptimizer

opt = HVRTOptimizer(n_trials=50, n_jobs=4, cv=3, random_state=42).fit(X, y)
print(f'Best TSTR Δ: {opt.best_score_:+.4f}')
print(f'Best params: {opt.best_params_}')

X_synth = opt.expand(n=50000)        # uses tuned kernel + params
X_aug   = opt.augment(n=len(X) * 5)  # originals + synthetic

HVRTOptimizer searches over n_partitions, min_samples_leaf, y_weight, kernel / bandwidth, and variance_weighted using TPE sampling, with TRTR pre-computed once to halve GBM fitting overhead. The fitted best_model_ is refitted on the full dataset after tuning.

For a custom objective or manual grid search:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import numpy as np
from hvrt import HVRT

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

def tstr_delta(n_partitions, bandwidth, variance_weighted=False, seed=42):
    XY_tr = np.column_stack([X_tr, y_tr.reshape(-1, 1)])
    model = HVRT(n_partitions=n_partitions, bandwidth=bandwidth,
                 random_state=seed).fit(XY_tr)
    XY_s = model.expand(n=len(X_tr) * 5, variance_weighted=variance_weighted)
    X_s, y_s = XY_s[:, :-1], XY_s[:, -1]
    trtr = r2_score(y_te, GradientBoostingRegressor(
                        random_state=seed).fit(X_tr, y_tr).predict(X_te))
    tstr = r2_score(y_te, GradientBoostingRegressor(
                        random_state=seed).fit(X_s, y_s).predict(X_te))
    return tstr - trtr

best_score, best_cfg = float('-inf'), {}
for n_parts in [None, 30, 50, 100]:   # None = let auto-tune decide
    for bw in ['auto', 0.10, 0.30]:
        score = tstr_delta(n_partitions=n_parts, bandwidth=bw)
        if score > best_score:
            best_score, best_cfg = score, {'n_partitions': n_parts, 'bandwidth': bw}

print(f'Best TSTR Δ={best_score:+.4f}  params={best_cfg}')

Recommended tuning sequence:

Run with defaults. Establish a baseline TSTR Δ. If it is close to zero, stop.
Sweep n_partitions. This has the largest effect on heterogeneous data. Try None (auto), 20, 30, 50, 75, 100. More partitions only help when n is large enough — a rule of thumb is at least 10–15 real samples per partition.
Check bandwidth. With 'auto', HVRT already picks the right kernel for the resulting partition size. If you have prior knowledge (classification → prefer 'epanechnikov'; regression with large partitions → prefer 0.10), override it.
Try variance_weighted=True if your dataset has a long tail or rare events you want the expansion to oversample.
If TSTR remains poor at any partition count, the dataset likely has inherently unpredictable local structure (e.g., the same feature region maps to multiple distinct outcomes). Consider conditioning: split by y quantile or class and expand each subset independently.

What not to try: Expanding synthetically and re-fitting HVRT on that output ("two-phase pipeline") to manufacture fine partitions does not improve TSTR. Phase 1 Gaussian smoothing introduces distribution drift that Phase 2 amplifies, and the net TSTR is worse than single-phase at the auto partition count. Finer partitions must come from more real data.

Benchmarks

Sample reduction

Metric: GBM ROC-AUC on reduced training set as % of full-training-set AUC. n=3 000 train / 2 000 test, seed=42.

Scenario	Retention	HVRT-fps	HVRT-yw	Random	Stratified
Well-behaved (Gaussian, no noise)	10%	97.1%	98.1%	96.9%	98.0%
Well-behaved (Gaussian, no noise)	20%	98.7%	98.9%	98.3%	99.0%
Noisy labels (20% random flip)	10%	96.1%	91.1%	93.3%	90.4%
Noisy labels (20% random flip)	20%	95.2%	95.9%	93.1%	93.1%
Heavy-tail + label noise + junk features	30%	98.2%	98.2%	94.3%	95.2%
Rare events (5% positive class)	10%	98.0%	99.4%	86.5%	94.1%
Rare events (5% positive class)	20%	98.0%	100.4%	97.9%	99.0%

HVRT-fps: method='fps', variance_weighted=True. HVRT-yw: same + y_weight=0.3.

Reproduce: python benchmarks/reduction_denoising_benchmark.py

Synthetic data expansion

Metric: discriminator accuracy (target 50% = indistinguishable), marginal KS fidelity, tail MSE. bandwidth=0.5, synthetic-to-real ratio 1×.

Method	Marginal Fidelity	Discriminator	Tail Error	Fit time
HVRT	0.974	49.6%	0.004	0.07 s
Gaussian Copula	0.998	49.4%	0.017	0.02 s
GMM (k=10)	0.989	49.2%	0.093	1.06 s
Bootstrap + Noise	0.994	49.7%	0.131	0.00 s
SMOTE	1.000	48.6%	0.000	0.00 s
CTGAN†	0.920	55.8%	0.500	45 s
TVAE†	0.940	53.5%	0.450	40 s
TabDDPM†	0.960	52.0%	0.300	120 s
MOSTLY AI†	0.975	51.0%	0.150	60 s

† Published numbers. Discriminator = 50% is ideal. Tail error = 0 is ideal.

Reproduce: python benchmarks/run_benchmarks.py --tasks expand

Benchmarking Scripts

python benchmarks/run_benchmarks.py
python benchmarks/run_benchmarks.py --tasks reduce --datasets adult housing
python benchmarks/run_benchmarks.py --tasks expand
python benchmarks/reduction_denoising_benchmark.py
python benchmarks/adaptive_kde_benchmark.py
python benchmarks/adaptive_full_benchmark.py
python benchmarks/heart_disease_benchmark.py      # requires: pip install ctgan
python benchmarks/bootstrap_failure_benchmark.py
python benchmarks/hpo_benchmark.py               # HPO vs defaults, nested CV (requires: pip install hvrt[optimizer])
python benchmarks/hpo_benchmark.py --quick       # 3 datasets, 10 trials, fast mode

Backward Compatibility

The v1 API is still importable:

from hvrt import HVRTSampleReducer, AdaptiveHVRTReducer

reducer = HVRTSampleReducer(reduction_ratio=0.2, random_state=42)
X_reduced, y_reduced = reducer.fit_transform(X, y)

The mode constructor parameter is deprecated. Replace with params objects:

# Deprecated
HVRT(mode='reduce')

# Replacement
HVRT(reduce_params=ReduceParams(ratio=0.3))

Testing

pytest
pytest --cov=hvrt --cov-report=term-missing

Citation

@software{hvrt2026,
  author = {Peace, Jake},
  title  = {HVRT: Hierarchical Variance-Retaining Transformer},
  year   = {2026},
  url    = {https://github.com/hotprotato/hvrt}
}

License

GNU Affero General Public License v3 or later (AGPL-3.0-or-later) — see LICENSE.

Acknowledgments

Development assisted by Claude (Anthropic).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.12.1

Mar 24, 2026

2.11.0

Mar 3, 2026

2.10.0

Mar 3, 2026

2.9.0

Mar 1, 2026

2.8.1

Feb 28, 2026

2.8.0

Feb 28, 2026

2.7.0

Feb 27, 2026

2.6.1

Feb 26, 2026

2.6.0

Feb 26, 2026

2.5.0

Feb 26, 2026

This version

2.4.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hvrt-2.4.0.tar.gz (90.6 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hvrt-2.4.0-py3-none-any.whl (75.8 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file hvrt-2.4.0.tar.gz.

File metadata

Download URL: hvrt-2.4.0.tar.gz
Upload date: Feb 25, 2026
Size: 90.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for hvrt-2.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ba5eb384429acf7f3e2819efeb913414e0cafbf6465fd8d86e36e2e1dff6910d`
MD5	`205d7a9f44e315dbbadec44b96e1ef8e`
BLAKE2b-256	`d43ebefde2a3dbe0d3803a77f48eb9a12f5753fcaa33ff0b46fa45ae855e44d5`

See more details on using hashes here.

File details

Details for the file hvrt-2.4.0-py3-none-any.whl.

File metadata

Download URL: hvrt-2.4.0-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 75.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for hvrt-2.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`229b265ac6f06c4db9ae9d08118eba4c8b4d73b4e15e7867cc5ef5af2ddca97a`
MD5	`e86c046641bb34c675970848b503c603`
BLAKE2b-256	`6fd984c36ab40dc6ff80648370587bb9cd5711e99b99b269c0b6c3cb8b41431c`

See more details on using hashes here.

hvrt 2.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HVRT: Hierarchical Variance-Retaining Transformer

Overview

Algorithm

1. Z-score normalisation

2. Synthetic target construction

3. Partitioning

4. Per-partition operations

Installation

Quick Start

API Reference

HVRT

FastHVRT

HVRTOptimizer

fit

reduce

expand

augment

Utility methods

sklearn Pipeline

ReduceParams

ExpandParams

AugmentParams

Generation Strategies

Selection Strategies

Recommendations

bandwidth='auto' — the default

Why Scott's rule underperforms

Partition granularity

Hyperparameter optimisation (HPO)

Benchmarks

Sample reduction

Synthetic data expansion

Benchmarking Scripts

Backward Compatibility

Testing

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`HVRT`

`FastHVRT`

`HVRTOptimizer`

`fit`

`reduce`

`expand`

`augment`

`bandwidth='auto'` — the default