DeepVariance Python AutoML SDK — LLM-driven pipelines for tabular ML and image classification
Project description
DeepVariance SDK
DeepVariance is a Python AutoML SDK that combines LLM-driven code generation with AutoGluon to automatically cast, clean, sample, preprocess, and train ML models on any tabular dataset — with a single pipeline.run() call.
Table of Contents
- How it works
- Requirements
- Installation
- Configuration
- Quickstart
- Pipeline output
- PipelineConfig reference
- Progress callbacks
- Build
- Development
- Documentation
How it works
The MLPipeline executes 7 sequential layers against your DataFrame:
| # | Layer | Type | What it does |
|---|---|---|---|
| 1 | AutoCastLayer |
LLM → code | Infers and applies column types, encodes categoricals |
| 2 | DataProfilingLayer |
Deterministic | Computes feature + target statistics |
| 3 | CorrelationLayer |
Deterministic | Pearson correlation matrix + mutual information scores |
| 4 | SamplingLayer |
LLM → code | Produces a stratified, representative sample |
| 5 | PreprocessingLayer |
LLM → code | Generates and applies pandas transforms (imputation, scaling, …) |
| 6 | ModelRecommendationLayer |
LLM → recommendation | Selects the best AutoGluon model codes for your task |
| 7 | ModelTrainingLayer |
Deterministic | Trains and evaluates a TabularPredictor, returns metrics |
LLM-driven layers use a retry loop — if the generated code raises an exception, the error is fed back to the LLM for self-correction.
Requirements
- Python ≥ 3.12
- A DeepVariance API key — email founders@deepvariance.com or fill the contact form at deepvariance.com
- An OpenAI or Groq API key
Installation
pip install deepvariance-sdk
Dependencies installed automatically: pandas, numpy, scipy, scikit-learn, psutil, openai, groq, autogluon.tabular, torch, torchvision
Dev install (from source)
git clone https://github.com/deepvariance/deepvariance-sdk
cd deepvariance-sdk
uv venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e ".[dev]" # installs all deps + pytest, ruff, cython
Configuration
The SDK reads credentials from environment variables. Set them in your shell before running:
export DV_API_KEY=your-deepvariance-api-key
export OPENAI_API_KEY=sk-...
export GROQ_API_KEY=gsk_... # fallback if OpenAI key is absent
The SDK resolves LLM providers in order: OpenAI → Groq. You only need one.
Optional: load from a .env file (local dev)
python-dotenv is not required by the SDK, but it is a convenient way to
manage keys during local development.
pip install python-dotenv
Create a .env file at the project root (see .env.example):
# .env
DV_API_KEY=dv_...
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk_...
Then load it at the top of your script, before constructing PipelineConfig:
from dotenv import load_dotenv
load_dotenv() # reads .env into os.environ
import os
from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig
config = PipelineConfig(
dv_api_key=os.getenv("DV_API_KEY"),
openai_api_key=os.getenv("OPENAI_API_KEY"),
)
Never commit your
.envfile. Add it to.gitignore:.envThe
.env.examplefile in the repo root shows all available environment variables.
Quickstart
import os
import pandas as pd
from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig
# 1. Load your data
data = pd.read_csv("your_dataset.csv")
# 2. Configure
config = PipelineConfig(
dv_api_key=os.getenv("DV_API_KEY"),
openai_api_key=os.getenv("OPENAI_API_KEY"),
groq_api_key=os.getenv("GROQ_API_KEY"),
sample_percentage=0.1, # train on a 10% stratified sample
)
# 3. Run
pipeline = MLPipeline(config=config)
result = pipeline.run(data, target="your_target_column")
# 4. Inspect results
print(result["metrics"])
print(result["leaderboard"])
Run the bundled examples directly:
# Binary classification — Australia weather dataset
.venv/bin/python examples/ml_quickstart.py
# Regression — medical insurance dataset
.venv/bin/python examples/insurance_regression.py
Pipeline output
pipeline.run() returns a dict:
| Key | Type | Description |
|---|---|---|
metrics |
dict[str, float] |
Accuracy, F1, ROC-AUC, RMSE, R², … (task-dependent) |
model |
TabularPredictor |
Trained AutoGluon predictor |
leaderboard |
pd.DataFrame |
All candidate models ranked by validation score |
feature_importance |
pd.DataFrame | None |
Feature importance scores from the best model |
run_stats |
dict |
Wall-clock duration and peak memory per layer |
Classification metrics
accuracy, f1_macro, f1_weighted, precision_macro, precision_weighted, recall_macro, recall_weighted, cohen_kappa, mcc, roc_auc (binary) / roc_auc_ovr (multiclass), log_loss
Regression metrics
rmse, mae, r2, median_ae, max_error, explained_var, mape
PipelineConfig reference
@dataclass
class PipelineConfig:
dv_api_key: str | None = None # DeepVariance API key (or set DV_API_KEY env var)
openai_api_key: str | None = None # OpenAI API key
groq_api_key: str | None = None # Groq API key (fallback)
sample_percentage: float | None = None # e.g. 0.1 → 10% sample fed to AutoGluon
extra: dict[str, Any] = field(default_factory=dict) # pipeline-specific overrides
sample_percentage controls the fraction of rows passed to AutoGluon after the LLM sampling stage. For large datasets (> 100k rows) a value of 0.1–0.2 keeps training fast while preserving distribution.
Progress callbacks
Pass an on_progress callable to get real-time stage updates:
def on_progress(stage: str, status: str) -> None:
# stage — e.g. "AutoCastLayer", "ModelTrainingLayer"
# status — "start" | "complete" | "error"
icon = {"start": "▶", "complete": "✓", "error": "✗"}.get(status, "·")
print(f" {icon} {stage}: {status}")
result = pipeline.run(data, target="label", on_progress=on_progress)
Build
The release wheel compiles all source to native C extensions via Cython — no Python source is included in the distributed package.
# Install build dependencies (one-time)
uv pip install -e ".[dev]"
# Compile extensions in-place (for local dev / running tests against .so)
just build-ext
# Build a release wheel (compiled .so only, no .py source)
just build-wheel
# → dist/deepvariance_sdk-1.0.0-cp312-cp312-macosx_10_9_universal2.whl
For CI, build on each target platform (macOS arm64, Linux x86_64) and upload all wheels to PyPI so users get the right binary for their machine.
Documentation
The project now includes Sphinx-based documentation under the docs/ directory. To build the HTML locally:
# install docs dependencies (optional group)
uv pip install -e ".[docs]" # or use pip/poetry/uv manually
cd docs
make html # requires make; or run `sphinx-build -b html . _build/html`
The generated site will appear in docs/_build/html/index.html.
See docs/quickstart.rst for a getting‑started guide and docs/api.rst for
an auto‑generated API reference.
Development
# Run tests
.venv/bin/python -m pytest tests/ -q
# Lint
.venv/bin/ruff check src/ tests/
# Format
.venv/bin/ruff format src/ tests/
All lint rules are configured in pyproject.toml under [tool.ruff].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepvariance_sdk-1.0.1-cp312-cp312-macosx_10_9_universal2.whl.
File metadata
- Download URL: deepvariance_sdk-1.0.1-cp312-cp312-macosx_10_9_universal2.whl
- Upload date:
- Size: 5.3 MB
- Tags: CPython 3.12, macOS 10.9+ universal2 (ARM64, x86-64)
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03a7b5c616a8cbe23f035767e62908909c32b9f2e709779d599d720e1791dbad
|
|
| MD5 |
4bf874e075037699c0d5bc7edf0e91ea
|
|
| BLAKE2b-256 |
e6d8b84022ac06d1e612395d653b9b67d419b67126172aa4db1f620ac8babedd
|