A PyTorch-compatible API with Candle backend

These details have not been verified by PyPI

Project description

🕯️ Torch-Candle: Vectorized Deep Learning Core with Drop-In PyTorch Compatibility

Torch-Candle is a high-performance deep learning library combining the mathematical simplicity and drop-in interface of PyTorch with the blazing-fast, memory-efficient Candle Rust backend.

Engineered for production reliability, minimal memory footprints, and state-of-the-art academic training innovations.

🚀 Key Architectural Pillars

1. Drop-In PyTorch Compatibility

Replace PyTorch with a single line. Torch-Candle can dynamically register itself in Python's environment registry, translating all standard PyTorch model loads, functions, and operations to high-speed vectorized C++/Rust backends:

import torch_candle as torch
torch.enable_torch_compat()

# Future standard PyTorch imports automatically redirect!
import torch
x = torch.Tensor([1.0, 2.0, 3.0])

2. Self-Healing Autograd (SHA) Engine

Catastrophic gradient explosions (NaN/Inf) caused by numerical instability (like dividing by zero or exponential overflows) permanently corrupt weights in standard frameworks. SHA dynamically intercepts anomalies during the backward pass at an element level and reconstructs stable estimates using a dynamic Exponential Moving Average (EMA) of parameter gradient history: $$g_{t} = \beta g_{t-1} + (1 - \beta) g_{curr}$$

3. Auto-Device Alignment Discovery

Bypass RuntimeError: Expected all tensors to be on the same device permanently. Arithmetic mutators, logical operators, and matrix multiplications automatically detect cross-device operands (e.g. CPU vs. CUDA) and align them to the primary execution device on-the-fly without crashing.

4. Zero-Allocation In-Place AdamW Optimizer

Eliminate unnecessary memory allocation overhead. Parameters, momentum vectors, and velocity states are mutated directly in-place, offering a significant speedup and minimal memory allocation peaks.

5. Dynamic Graph JIT Compiler (`torch.compile`)

Optimizes hot execution paths via lightweight tracing. Traces functional subgraphs, compiles vectorized execution pathways, and caches hot execution calls for near-instant subsequent executions.

6. Causal Attention (SDPA) with Contiguous Layouts

Includes highly optimized Multi-Head Attention and Scaled Dot-Product Attention with native hardware-accelerated memory contiguity alignments, perfect for Transformer and Large Language Model (LLM) fine-tuning pipelines.

7. Decoupled Local Analytical Solving (DLLT-AS)

A revolutionary zero-backpropagation training framework. Instead of slow iterative gradient descent (Adam/SGD) over hundreds of epochs, DLLT-AS solves layer weight matrices analytically in a single closed-form pass using Moore-Penrose Pseudo-Inverse (Ridge) projections: $$W_k = (X_k^T X_k + \lambda I)^{-1} X_k^T Y$$ Combined with Swish activation gating and Dense Representation Reuse (DRR), DLLT-AS trains a multi-layer deep network in a single mathematical step (under 22ms), achieving 98.00% accuracy on classification benchmarks with virtually zero computational and energy cost.

🛠️ Installation

📦 Pre-compiled PyPI Packages (Recommended for End-Users)

If you want to run Torch-Candle without compiling it from source, install the pre-compiled binaries:

CPU / macOS Metal (Apple Silicon):
```
pip install torch-candle
```
(Note: Apple Silicon users automatically get Metal GPU acceleration out-of-the-box).
NVIDIA CUDA (Includes CPU + GPU acceleration):
```
pip install torch-candle-cuda
```

🛠️ Building from Source (Local Compilation)

If you are developing or compiling specifically for your machine's hardware architecture, follow the instructions below.

Prerequisite: Rust Toolchain

Since Torch-Candle compiles native C++/Rust kernels during installation, ensure the Rust toolchain is installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

⚡ Installing with `uv` (Recommended)

By default, standard installation builds the CPU-only version (optimized automatically for your system's instruction set, like AVX2 on x86 or NEON on ARM).

To compile the library with hardware acceleration tailored to your system, use the following commands:

1. CPU-Only (Default)

uv pip install torch-candle

2. NVIDIA CUDA Acceleration (NVIDIA GPUs)

Ensure the CUDA Toolkit is installed and nvcc is available in your PATH or standard location (e.g. /usr/local/cuda).

From a local git clone (Recommended): We provide an automated hardware detection and installation script that configures CUDA paths and queries your GPU's exact compute capability automatically:
```
# Standard installation
python install.py

# Editable installation (for active development)
python install.py -e
```
From PyPI (forcing a custom source build): If installing directly from PyPI without cloning, you can still compile with CUDA support manually:
```
CUDA_HOME=/usr/local/cuda CUDA_PATH=/usr/local/cuda CUDA_COMPUTE_CAP=75 MATURIN_PEP517_ARGS="--features pyo3/extension-module,cuda" uv pip install --force-reinstall --no-cache torch-candle --no-binary torch-candle
```
(Note: Set CUDA_COMPUTE_CAP to match your GPU architecture, e.g., 89 for Ada Lovelace, 80 for Ampere, 75 for Turing/GTX 1650).

3. Apple Silicon GPU / Metal (macOS)

From a local git clone (Recommended): The installer script automatically detects macOS and enables Metal and Accelerate features:
```
# Standard installation
python install.py

# Editable installation (for active development)
python install.py -e
```

From PyPI (forcing a custom source build):

MATURIN_PEP517_ARGS="--features pyo3/extension-module,metal,accelerate" uv pip install --force-reinstall --no-cache torch-candle --no-binary torch-candle

🐍 Standard Installation using `pip`

Standard pip supports the exact same build variables by forcing a source distribution build:

# NVIDIA CUDA
CUDA_HOME=/usr/local/cuda CUDA_PATH=/usr/local/cuda CUDA_COMPUTE_CAP=75 MATURIN_PEP517_ARGS="--features pyo3/extension-module,cuda" pip install --force-reinstall --no-cache-dir torch-candle --no-binary torch-candle

# Apple Silicon GPU
MATURIN_PEP517_ARGS="--features pyo3/extension-module,metal,accelerate" pip install --force-reinstall --no-cache-dir torch-candle --no-binary torch-candle

🛠️ Local Development Build

For active development in this repository, compile the Rust extension directly:

# Build and link the local editable module with CUDA
CUDA_HOME=/usr/local/cuda CUDA_COMPUTE_CAP=75 .venv/bin/maturin develop --features "pyo3/extension-module,cuda"

[!IMPORTANT] When running Python scripts locally using uv run, uv will automatically check pyproject.toml and rebuild the package using default features (CPU-only), overwriting your manual CUDA build. To prevent uv from overwriting your custom hardware build, always run with the --no-sync flag:
uv run --no-sync your_script.py

💡 Quickstart Example: LoRA Model Fine-Tuning

import torch_candle as torch
import torch_candle.nn as nn
import torch_candle.optim as optim
import torch_candle.nn.functional as F

# 1. Initialize a model
model = nn.Linear(128, 64)

# 2. Setup training criteria and zero-allocation optimizer
optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# 3. Fine-tuning step with Auto-Device Alignment active
x = torch.Tensor([[1.0] * 128], device="cpu")
target = torch.Tensor([[0.0] * 64], device="cuda" if torch.cuda.is_available() else "cpu")

optimizer.zero_grad()
output = model(x)
loss = F.mse_loss(output, target)
loss.backward()
optimizer.step()

print(f"Fine-tuned Step Loss: {loss.item():.4f}")

Zero-Backpropagation Analytical Learning (DLLT-AS)

import torch_candle as torch
import torch_candle.nn as nn

# 1. Initialize input features and targets
x = torch.Tensor([[1.2, -0.5, 0.8], [0.5, 1.1, -1.2], [-0.3, 0.4, 0.9]])
target = torch.Tensor([[1.0, 0.0], [0.0, 1.0], [1.0, 0.0]]) # One-hot

# 2. Instantiate our zero-backprop DLLT-AS Model
# in_features=3, hidden_dim=16, out_classes=2
model = nn.DLLTASModel(in_features=3, hidden_dim=16, out_classes=2)

# 3. Train all deep decoupled layers analytically in a single mathematical step!
# Completes in under 22ms on standard CPU!
model.fit(x, target)

# 4. Predict instantly with solved weights
predictions = model(x)
print(f"Solved Predictions Output:\n{predictions.numpy()}")

🧪 Visual Verification Suites

Torch-Candle includes two dedicated CLI scripts to verify your hardware configuration and test training resilience:

Hardware Diagnostics & E2E LoRA SFT Pipeline:
```
python3 tests/diagnose_hardware.py
```

Self-Healing Autograd Comparative Test:

python3 tests/test_self_healing_demo.py

🔧 Memory Allocation Tuning (Linux)

To prevent glibc memory arena fragmentation under high concurrency, Torch-Candle automatically sets MALLOC_MMAP_THRESHOLD_=65536 on import, which forces glibc to use mmap instead of heap arenas for allocations above 64KB. This eliminates OOM fragmentation without requiring root privileges.

If launching from a shell script, you can also set this before the process boots:

# Force glibc to use mmap for allocations ≥ 64KB (prevents arena fragmentation)
export MALLOC_MMAP_THRESHOLD_=65536
python train.py

Note: Do not use sysctl or modify /etc/sysctl.conf for memory tuning — this requires root privileges and targets the wrong kernel parameter.

📄 License

Licensed under the MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2026.6.13

Jun 14, 2026

2026.6.12

Jun 14, 2026

2026.6.11

Jun 14, 2026

2026.6.10

Jun 13, 2026

2026.6.9

Jun 13, 2026

2026.6.8

Jun 13, 2026

This version

2026.6.7

Jun 13, 2026

2026.6.6

Jun 13, 2026

2026.6.5

Jun 12, 2026

2026.6.4

Jun 12, 2026

2026.6.3

Jun 11, 2026

2026.6.2

Jun 10, 2026

2026.6.1

Jun 5, 2026

0.1.1

Jun 5, 2026

0.1.0

Jun 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torch_candle-2026.6.7.tar.gz (128.5 kB view details)

Uploaded Jun 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

torch_candle-2026.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded Jun 13, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file torch_candle-2026.6.7.tar.gz.

File metadata

Download URL: torch_candle-2026.6.7.tar.gz
Upload date: Jun 13, 2026
Size: 128.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for torch_candle-2026.6.7.tar.gz
Algorithm	Hash digest
SHA256	`2818e0a18cd4eed36225d9caab404f6e3ed9012660b579d7a1b8141d509139c6`
MD5	`129c15fda62bac99095eb9e7d270a2c2`
BLAKE2b-256	`dff9b646a63bceb22115a8d8830f99d73c436f35729f1b2bbe6597485009f515`

See more details on using hashes here.

File details

Details for the file torch_candle-2026.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: torch_candle-2026.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jun 13, 2026
Size: 1.6 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.16

File hashes

Hashes for torch_candle-2026.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`05c941a098de4fab8a487a29e094e952044345ae30650a252e72d8df20b43130`
MD5	`b463858cfbfc662c358d61168a85423d`
BLAKE2b-256	`11ca02fc9812665a1f9ceb08b3ba01dff22624d7869312553325acc005d21f3d`

See more details on using hashes here.

torch-candle 2026.6.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🕯️ Torch-Candle: Vectorized Deep Learning Core with Drop-In PyTorch Compatibility

🚀 Key Architectural Pillars

1. Drop-In PyTorch Compatibility

2. Self-Healing Autograd (SHA) Engine

3. Auto-Device Alignment Discovery

4. Zero-Allocation In-Place AdamW Optimizer

5. Dynamic Graph JIT Compiler (torch.compile)

6. Causal Attention (SDPA) with Contiguous Layouts

7. Decoupled Local Analytical Solving (DLLT-AS)

🛠️ Installation

📦 Pre-compiled PyPI Packages (Recommended for End-Users)

🛠️ Building from Source (Local Compilation)

Prerequisite: Rust Toolchain

⚡ Installing with uv (Recommended)

1. CPU-Only (Default)

2. NVIDIA CUDA Acceleration (NVIDIA GPUs)

3. Apple Silicon GPU / Metal (macOS)

🐍 Standard Installation using pip

🛠️ Local Development Build

💡 Quickstart Example: LoRA Model Fine-Tuning

Zero-Backpropagation Analytical Learning (DLLT-AS)

🧪 Visual Verification Suites

🔧 Memory Allocation Tuning (Linux)

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

5. Dynamic Graph JIT Compiler (`torch.compile`)

⚡ Installing with `uv` (Recommended)

🐍 Standard Installation using `pip`