Skip to main content

DeepVariance Python AutoML SDK — LLM-driven pipelines for tabular ML and image classification

Project description

DeepVariance SDK

DeepVariance is a Python AutoML SDK that combines LLM-driven code generation with AutoGluon to automatically cast, clean, sample, preprocess, and train ML models on any tabular dataset — with a single pipeline.run() call.


Table of Contents


How it works

The MLPipeline executes 7 sequential layers against your DataFrame:

# Layer Type What it does
1 AutoCastLayer LLM → code Infers and applies column types, encodes categoricals
2 DataProfilingLayer Deterministic Computes feature + target statistics
3 CorrelationLayer Deterministic Pearson correlation matrix + mutual information scores
4 SamplingLayer LLM → code Produces a stratified, representative sample
5 PreprocessingLayer LLM → code Generates and applies pandas transforms (imputation, scaling, …)
6 ModelRecommendationLayer LLM → recommendation Selects the best AutoGluon model codes for your task
7 ModelTrainingLayer Deterministic Trains and evaluates a TabularPredictor, returns metrics

LLM-driven layers use a retry loop — if the generated code raises an exception, the error is fed back to the LLM for self-correction.


Requirements


Installation

pip install deepvariance-sdk

Dependencies installed automatically: pandas, numpy, scipy, scikit-learn, psutil, openai, groq, autogluon.tabular, torch, torchvision

Dev install (from source)

git clone https://github.com/deepvariance/deepvariance-sdk
cd deepvariance-sdk
uv venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"             # installs all deps + pytest, ruff, cython

Configuration

The SDK reads credentials from environment variables. Set them in your shell before running:

export DV_API_KEY=your-deepvariance-api-key
export OPENAI_API_KEY=sk-...
export GROQ_API_KEY=gsk_...   # fallback if OpenAI key is absent

The SDK resolves LLM providers in order: OpenAI → Groq. You only need one.

Optional: load from a .env file (local dev)

python-dotenv is not required by the SDK, but it is a convenient way to manage keys during local development.

pip install python-dotenv

Create a .env file at the project root (see .env.example):

# .env
DV_API_KEY=dv_...
OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk_...

Then load it at the top of your script, before constructing PipelineConfig:

from dotenv import load_dotenv
load_dotenv()          # reads .env into os.environ

import os
from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig

config = PipelineConfig(
    dv_api_key=os.getenv("DV_API_KEY"),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
)

Never commit your .env file. Add it to .gitignore:

.env

The .env.example file in the repo root shows all available environment variables.


Quickstart

import os
import pandas as pd

from deepvariance.pipelines.ml import MLPipeline
from deepvariance.typings import PipelineConfig

# 1. Load your data
data = pd.read_csv("your_dataset.csv")

# 2. Configure
config = PipelineConfig(
    dv_api_key=os.getenv("DV_API_KEY"),
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    groq_api_key=os.getenv("GROQ_API_KEY"),
    sample_percentage=0.1,   # train on a 10% stratified sample
)

# 3. Run
pipeline = MLPipeline(config=config)
result = pipeline.run(data, target="your_target_column")

# 4. Inspect results
print(result["metrics"])
print(result["leaderboard"])

Run the bundled examples directly:

# Binary classification — Australia weather dataset
.venv/bin/python examples/ml_quickstart.py

# Regression — medical insurance dataset
.venv/bin/python examples/insurance_regression.py

Pipeline output

pipeline.run() returns a dict:

Key Type Description
metrics dict[str, float] Accuracy, F1, ROC-AUC, RMSE, R², … (task-dependent)
model TabularPredictor Trained AutoGluon predictor
leaderboard pd.DataFrame All candidate models ranked by validation score
feature_importance pd.DataFrame | None Feature importance scores from the best model
run_stats dict Wall-clock duration and peak memory per layer

Classification metrics

accuracy, f1_macro, f1_weighted, precision_macro, precision_weighted, recall_macro, recall_weighted, cohen_kappa, mcc, roc_auc (binary) / roc_auc_ovr (multiclass), log_loss

Regression metrics

rmse, mae, r2, median_ae, max_error, explained_var, mape


PipelineConfig reference

@dataclass
class PipelineConfig:
    dv_api_key:    str | None = None   # DeepVariance API key (or set DV_API_KEY env var)
    openai_api_key: str | None = None  # OpenAI API key
    groq_api_key:   str | None = None  # Groq API key (fallback)
    sample_percentage: float | None = None  # e.g. 0.1 → 10% sample fed to AutoGluon
    extra: dict[str, Any] = field(default_factory=dict)  # pipeline-specific overrides

sample_percentage controls the fraction of rows passed to AutoGluon after the LLM sampling stage. For large datasets (> 100k rows) a value of 0.10.2 keeps training fast while preserving distribution.


Progress callbacks

Pass an on_progress callable to get real-time stage updates:

def on_progress(stage: str, status: str) -> None:
    # stage  — e.g. "AutoCastLayer", "ModelTrainingLayer"
    # status — "start" | "complete" | "error"
    icon = {"start": "▶", "complete": "✓", "error": "✗"}.get(status, "·")
    print(f"  {icon}  {stage}: {status}")

result = pipeline.run(data, target="label", on_progress=on_progress)

Build

The release wheel compiles all source to native C extensions via Cython — no Python source is included in the distributed package.

# Install build dependencies (one-time)
uv pip install -e ".[dev]"

# Compile extensions in-place (for local dev / running tests against .so)
just build-ext

# Build a release wheel (compiled .so only, no .py source)
just build-wheel
# → dist/deepvariance_sdk-1.0.0-cp312-cp312-macosx_10_9_universal2.whl

For CI, build on each target platform (macOS arm64, Linux x86_64) and upload all wheels to PyPI so users get the right binary for their machine.


Documentation

The project now includes Sphinx-based documentation under the docs/ directory. To build the HTML locally:

# install docs dependencies (optional group)
uv pip install -e ".[docs]"    # or use pip/poetry/uv manually
cd docs
make html             # requires make; or run `sphinx-build -b html . _build/html`

The generated site will appear in docs/_build/html/index.html.

See docs/quickstart.rst for a getting‑started guide and docs/api.rst for an auto‑generated API reference.

Development

# Run tests
.venv/bin/python -m pytest tests/ -q

# Lint
.venv/bin/ruff check src/ tests/

# Format
.venv/bin/ruff format src/ tests/

All lint rules are configured in pyproject.toml under [tool.ruff].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepvariance_sdk-1.0.1-cp312-cp312-macosx_10_9_universal2.whl (5.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file deepvariance_sdk-1.0.1-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for deepvariance_sdk-1.0.1-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 03a7b5c616a8cbe23f035767e62908909c32b9f2e709779d599d720e1791dbad
MD5 4bf874e075037699c0d5bc7edf0e91ea
BLAKE2b-256 e6d8b84022ac06d1e612395d653b9b67d419b67126172aa4db1f620ac8babedd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page