Causal Residual Data Augmentation for Regression - A novel data augmentation methodology that improves regression model performance through residual-guided feature perturbation
Project description
CRDA - Causal Residual Data Augmentation
A novel data augmentation methodology that improves regression model performance by generating informed synthetic training examples through residual-guided feature perturbation.
Installation
From PyPI (Recommended)
pip install crda
From Source
git clone https://github.com/mhmohebbi/CRDA-package.git
cd CRDA-package
pip install -e .
Development Installation
pip install -e ".[dev]"
Quick Start
Basic Usage
from crda import CRDA, Config
from xgboost import XGBRegressor
# Configure the experiment
config = Config(
dataset="path/to/your/data.csv",
dataset_name="my_dataset",
random_seed=42,
verbose=True
)
# Create CRDA instance
crda = CRDA(config)
# Run the augmentation pipeline with your model
results = crda.run(XGBRegressor())
# View results
print(f"Original Score: {results['score'].values[0]:.4f}")
print(f"Augmented Score: {results['aug_score'].values[0]:.4f}")
print(f"Improvement: {results['delta_score'].values[0]:.2f}%")
Using a DataFrame
import pandas as pd
from crda import CRDA, Config
from sklearn.ensemble import RandomForestRegressor
# Load your data
df = pd.read_csv("data.csv")
# Configure with DataFrame directly
config = Config(
dataset=df,
dataset_name="my_data",
verbose=True
)
# Run with any sklearn-compatible regressor
crda = CRDA(config)
results = crda.run(RandomForestRegressor(n_estimators=100))
Usage Modes
CRDA supports multiple usage patterns to fit different workflows:
1. Standard Mode (Dataset from Config)
The simplest way to use CRDA - provide the dataset in the config:
config = Config(dataset="data.csv", dataset_name="my_data")
crda = CRDA(config)
results = crda.run(XGBRegressor())
2. Pre-Split Data Mode
Provide your own train/test splits:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
config = Config(dataset_name="my_data", skip_preprocess=True)
crda = CRDA(config)
results = crda.run(XGBRegressor(), X_train, y_train, X_test, y_test)
3. Pre-Trained Model Mode
Use an already-trained model - CRDA will skip training and continue from residual calculation:
# Train your model externally with custom hyperparameters
my_model = XGBRegressor(n_estimators=200, max_depth=5, learning_rate=0.05)
my_model.fit(X_train, y_train)
# CRDA uses the pre-trained model (skips preprocessing and training)
config = Config(dataset_name="my_data")
crda = CRDA(config)
results = crda.run(my_model, X_train, y_train, X_test, y_test, pretrained=True)
4. Get Augmented Data Only
Just retrieve the augmented data without running the full evaluation pipeline:
config = Config(dataset="data.csv", dataset_name="my_data")
crda = CRDA(config)
# Get only the augmented samples
aug_X, aug_y = crda.get_augmented_data(XGBRegressor())
# Or get combined (original + augmented) data
combined_X, combined_y, aug_X, aug_y = crda.get_augmented_data(
XGBRegressor(), return_combined=True
)
5. With Categorical Feature Indices
When providing pre-split numpy arrays with one-hot encoded categorical features:
# Columns 0-2 are one-hot encoded "color", columns 3-5 are "size"
cat_indices = {
"color": [0, 1, 2],
"size": [3, 4, 5]
}
config = Config(dataset_name="my_data")
crda = CRDA(config)
results = crda.run(
XGBRegressor(),
X_train, y_train, X_test, y_test,
cat_indices=cat_indices
)
Note: If
cat_indicesis not provided when using pre-split data, all features are treated as continuous.
6. Reuse CRDA with Different Models
The same CRDA instance can be reused with different models:
config = Config(dataset="data.csv", dataset_name="my_data")
crda = CRDA(config)
# Try XGBoost
results_xgb = crda.run(XGBRegressor())
# Try RandomForest with same config
results_rf = crda.run(RandomForestRegressor())
7. With Hyperparameter Tuning
config = Config(
dataset="data.csv",
dataset_name="my_data",
crda_param_tune=True, # Enable Optuna-based tuning
random_seed=42,
save_params=True,
save_models=True,
)
crda = CRDA(config)
results = crda.run(MLPRegressor(hidden_layer_sizes=(100, 50)))
Dataset Format
CRDA expects tabular data where:
- All columns except the last are features (numerical or categorical)
- The last column is the target variable (must be numerical/continuous)
- Supported formats: CSV, Excel (.xlsx), JSON, pickle (.pkl), or pandas DataFrame
feature1,feature2,feature3,target
1.2,3.4,5.6,10.5
2.1,4.3,6.5,12.3
...
The dataset is automatically preprocessed (unless skip_preprocess=True):
- Duplicate rows are removed
- Missing values are dropped
- Categorical features are one-hot encoded
- Continuous features are standardized (mean=0, std=1)
- Target variable is normalized to [0, 1]
Configuration Options
Core Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
str | DataFrame | None | Path to data file or pandas DataFrame. Optional if providing splits to run(). |
dataset_name |
str | "my_dataset" | Name identifier for the experiment. |
skip_preprocess |
bool | False | Skip preprocessing - use data as-is. |
evaluation_metric |
str | "mse" | Metric for evaluation: "mse", "rmse", or "r2" |
random_seed |
int | 0 | Random seed for reproducibility |
test_size |
float | 0.2 | Proportion of data for testing |
Augmentation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
aug_data_size_factor |
float | 1.0 | Multiplier for augmented data size |
max_n_features_to_perturb |
int | 5 | Maximum features to perturb |
max_perturb_percent |
float | 0.1 | Maximum perturbation (+10%) |
min_perturb_percent |
float | -0.1 | Minimum perturbation (-10%) |
crda_param_tune |
bool | False | Enable Optuna hyperparameter tuning for CRDA params |
Statistical Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
indep_test_threshold |
float | 0.05 | P-value threshold for independence test |
p_wilcoxon_threshold |
float | 0.05 | Significance threshold for Wilcoxon test |
ignore_filter |
bool | False | Proceed even if CRDA seems to produce bad results with the augmented data. |
Output Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
results_dir |
str | "./runs" | Directory for experiment results |
save_models |
bool | False | Save trained model artifacts |
save_params |
bool | False | Save optimized parameters |
verbose |
bool | False | Print logs to console |
log_file |
str | None | Path for log file output |
API Reference
CRDA
The main class for running causal residual data augmentation.
from crda import CRDA, Config
# Initialize with config only (model passed to methods)
crda = CRDA(config)
# Run full pipeline
results = crda.run(model)
# Or just get augmented data
aug_X, aug_y = crda.get_augmented_data(model)
Constructor:
CRDA(config: Config)
Methods:
run()
def run(
model, # sklearn-compatible regressor
X_train: np.ndarray = None, # Optional: training features
y_train: np.ndarray = None, # Optional: training targets
X_test: np.ndarray = None, # Optional: test features
y_test: np.ndarray = None, # Optional: test targets
pretrained: bool = False, # Is model already trained?
cat_indices: dict = None # Categorical column indices
) -> pd.DataFrame | None
Execute the full CRDA pipeline.
get_augmented_data()
def get_augmented_data(
model, # sklearn-compatible regressor
X_train: np.ndarray = None, # Optional: training features
y_train: np.ndarray = None, # Optional: training targets
X_test: np.ndarray = None, # Optional: test features
y_test: np.ndarray = None, # Optional: test targets
pretrained: bool = False, # Is model already trained?
cat_indices: dict = None, # Categorical column indices
return_combined: bool = False # Include original data in return?
) -> tuple[np.ndarray, np.ndarray]
Generate augmented data without running full evaluation.
Config
Configuration management for experiments.
from crda import Config
# Full config
config = Config(
dataset="data.csv",
dataset_name="example",
evaluation_metric="mse",
random_seed=42
)
# Minimal config (for pre-split data)
config = Config(evaluation_metric="rmse", random_seed=42)
# dataset_name defaults to "my_dataset"
Methods:
to_dict()- Convert config to dictionaryfrom_dict(d)- Create config from dictionary
AbstractDataset
Dataset handling and preprocessing.
from crda import AbstractDataset
# From file or DataFrame
dataset = AbstractDataset("name", "data.csv", seed=42)
X, y = dataset.preprocess()
X_train, X_test, y_train, y_test = dataset.split()
# From pre-split numpy arrays
dataset = AbstractDataset.from_splits(
name="my_data",
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
cat_indices={"color": [0, 1, 2]} # optional
)
BaselineRegressor
Wrapper for sklearn-compatible regressors.
from crda import BaselineRegressor
from xgboost import XGBRegressor
regressor = BaselineRegressor(XGBRegressor())
regressor.train(X_train, y_train)
predictions = regressor.predict(X_test)
mse = regressor.evaluate(X_test, y_test, metric="mse")
Results
The run() method returns a pandas DataFrame with:
| Column | Description |
|---|---|
dataset |
Dataset name |
seed |
Random seed used |
score |
Baseline model test score (MSE, RMSE, or R² based on config) |
aug_score |
Augmented model test score |
delta_score |
Percent improvement (positive = better) |
p_wilcoxon |
Statistical significance p-value |
should_proceed |
Whether augmentation was beneficial |
features_perturbed |
Names of perturbed features |
Method Overview
CRDA (Causal Residual Data Augmentation) improves regression models through:
- Residual Analysis: Compute prediction residuals from baseline model
- Causal Filtering: Identify features uncorrelated with residuals and conditionally independent of target
- Selective Perturbation: Perturb filtered features to create interventional data
- Counterfactual Targets: Generate targets using residual patterns
- Augmented Training: Train new model on combined original + augmented data
Key Innovation
Unlike traditional augmentation that blindly perturbs features, CRDA uses causal reasoning to select features that can be safely modified without corrupting the underlying data generating process (keeping residuals invariant).
Supported Models
CRDA works with any sklearn-compatible regressor:
- Tree-based: XGBoost, LightGBM, CatBoost, RandomForest, GradientBoosting
- Neural Networks: MLPRegressor, PyTorch models (with sklearn wrapper)
- Linear Models: Ridge, Lasso, ElasticNet, LinearRegression
- Others: SVR, KNeighborsRegressor, etc.
Requirements
- Python >= 3.8
- numpy >= 1.24.0
- pandas >= 2.0.0
- scikit-learn >= 1.3.0
- scipy >= 1.10.0
- torch >= 2.0.0
- optuna >= 3.0.0
- joblib >= 1.3.0
- hyppo >= 0.4.0
License
MIT License - see LICENSE for details.
Citation
If you use CRDA in your research, please cite:
@software{crda2026,
author = {Mohebbi, Hossein},
title = {CRDA: Causal Residual Data Augmentation for Regression},
year = {2026},
url = {https://github.com/mhmohebbi/CRDA-package}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crda-0.1.0.tar.gz.
File metadata
- Download URL: crda-0.1.0.tar.gz
- Upload date:
- Size: 34.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2df51d8a07dfec058853ae6592928606ff3bbd81eab312b0894dd8c53c612116
|
|
| MD5 |
2a90302762c1964f0edcf1965eb7411e
|
|
| BLAKE2b-256 |
4e7227fd64a18eda947e64f5d7b661641eabb1c047e462cbae576bc7a320ee54
|
File details
Details for the file crda-0.1.0-py3-none-any.whl.
File metadata
- Download URL: crda-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1a2bd1904af2faeb112bc22be3e83c4bb0c5b8eb823707e9ca37b81e394cfb9
|
|
| MD5 |
cc289fd51c83dcae38fb4d1dce307772
|
|
| BLAKE2b-256 |
a047c6f966232eea73a1890162f08b09cc1c3c93c88c3343c6dfea19b5adc091
|