Inference optimizer for decoder-only LLMs. One-line drop-in for HuggingFace models.
Project description
Agnitra
The inference optimizer for decoder-only LLMs. One Python keyword. No retraining. 2× memory ↓ · 1.5–2× throughput ↑ · cryptographically signed.
Quickstart · Why · Integrations · Quantization · Trust · CLI · Benchmarks · Roadmap · Contributing
⚡ Quickstart
pip install "agnitra[quantize]"
import torch
from agnitra.integrations.huggingface import AgnitraModel
model = AgnitraModel.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct", # open weights — no HF token
torch_dtype=torch.float16,
agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()
# Use `model` exactly like a HuggingFace model — tokenizer, .generate(), logits.
# Same outputs, lower memory, higher throughput.
quantize="auto" picks FP8 on H100/Blackwell and INT8 elsewhere. The full runnable script is at examples/quickstart.py.
🎯 Why Agnitra
torch.compileis now a no-op against HuggingFace defaults on Llama-3-8B intransformers4.44+. We measured it. The wedge has narrowed — quantization is the lever that's left.
- One line, not a serving stack. vLLM and TensorRT-LLM are serving runtimes requiring Python-side rewrites. Agnitra is an SDK — drop it into your existing
model.generate()code. - Quantization, automatic. HuggingFace doesn't quantize by default. Agnitra picks the best mode for your GPU (FP8 / INT8 / INT4) and falls back gracefully when hardware can't run it.
- Honest scoping. Models outside the supported set get a passthrough
RuntimeOptimizationResultwithnotes["passthrough"] = True. We never silently no-op.
📦 Install
pip install agnitra # base SDK — works without torch installed
pip install "agnitra[quantize]" # + INT8/INT4/FP8 via torchao (recommended)
pip install "agnitra[trust]" # + Ed25519 signed inference manifests
pip install "agnitra[quantize,trust]" # combined (most production deployments)
Other extras
pip install "agnitra[openai]" # + LLM-guided research path
pip install "agnitra[rl]" # + PPO-guided research path
pip install "agnitra[nvml]" # + GPU telemetry
pip install "agnitra[marketplace]" # (deprecated — kept for back-compat)
npm install agnitra # JS/TS HTTP client for agnitra-api
# NOT a port — calls a hosted server
🔌 Integrations
Five drop-in entry points — same wedge across every popular LLM framework.
HuggingFace transformers
from agnitra.integrations.huggingface import AgnitraModel
model = AgnitraModel.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()
Drop-in for AutoModelForCausalLM.from_pretrained. Pass model_class=AutoModelForSeq2SeqLM for non-CausalLM. Or swap inside an existing transformers.pipeline:
from agnitra.integrations.huggingface import optimize_pipeline
optimize_pipeline(pipe, agnitra_kwargs={"input_shape": (1, 512)})
LangChain
Agents call the LLM many times per task — model speedups compound into pipeline speedups.
from langchain_huggingface import HuggingFacePipeline
from agnitra.integrations.langchain import optimize_llm
llm = HuggingFacePipeline.from_model_id("...", task="text-generation")
optimize_llm(llm, agnitra_kwargs={"quantize": "auto"})
# Every chain / agent downstream inherits the speedup.
Auto-detects langchain_huggingface, langchain_community, and legacy paths.
LlamaIndex
from llama_index.llms.huggingface import HuggingFaceLLM
from agnitra.integrations.llama_index import optimize_llm
optimize_llm(llm, agnitra_kwargs={"quantize": "auto"})
Same compounding pattern for RAG and agent flows.
accelerate
Run after accelerator.prepare() so device placement is already done:
from accelerate import Accelerator
from agnitra.integrations.accelerate_helpers import optimize_after_prepare
accelerator = Accelerator()
model = accelerator.prepare(model)
model = optimize_after_prepare(model, input_shape=(1, 512))
NVIDIA TensorRT-LLM
Wraps a pre-built TensorRT-LLM engine in a HuggingFace-shaped runtime:
result = agnitra.optimize(
model,
backend="tensorrt_llm",
backend_kwargs={"engine_dir": "./engine"},
)
See docs/guides/nvidia.mdx for engine build, NIM packaging, and the NVIDIA Inception path.
🔧 Quantization
The single biggest cost lever in modern inference. Pick one or use "auto":
| Mode | Memory | Throughput | Quality | When |
|---|---|---|---|---|
"int8_weight" |
2× ↓ | ~1.5× ↑ | ~unchanged | Default safe choice; any CUDA GPU |
"int4_weight" |
4× ↓ | ~1.8× ↑ | mild drop | Memory-bound decode; smaller GPUs (4090, A40, L4) |
"fp8_weight" |
2× ↓ | ~2× ↑ | ~unchanged | Hopper / Blackwell tensor cores |
"auto" |
best for your GPU | — | — | Recommended portable default |
result = agnitra.optimize(model, input_shape=(1, 512), quantize="auto")
All four modes wrap torchao. Install via pip install "agnitra[quantize]".
🔒 Trust & provenance
Every successful agnitra.optimize() produces a cryptographically signed inference manifest — a tamper-evident record of base model SHA-256, optimizations applied, drift verification metrics, runtime context, and signer identity. Required for regulated deployments (banking, healthcare, EU AI Act high-risk systems, FDA SaMD).
result = agnitra.optimize(model, input_shape=(1, 512), quantize="auto")
print(result.notes["trust_manifest"]["signature"]) # ed25519:...
print(result.notes["trust_manifest"]["base_model"]["sha256"]) # 9f2b...
agnitra trust verify --manifest manifest.json
# OK signed by key_id=8f3b1c2d4e5a6b7c
agnitra trust keys generate # writes ~/.agnitra/keys/signing.pem (mode 0600)
agnitra trust keys show # public key fingerprint only — never private
agnitra trust inspect --manifest m.json # pretty-print without verifying
Install with pip install "agnitra[trust]". The cryptography dep is fully optional — trust signing silently no-ops when missing. See docs/guides/trust.mdx for the manifest schema, key management, and the Layer 1–5 trust roadmap (Layer 1 ships now; per-inference provenance tags, certified quantization recipes, cross-runtime determinism, and ZK proofs of inference are the longer arc).
🤖 Supported architectures
13 decoder-LM model_type values cover ~80% of LLM inference spend. Every fine-tune of a supported architecture inherits the base model's optimization decisions via architecture fingerprinting — 13 architectures effectively means ~100K HuggingFace fine-tunes.
| Architecture | model_type |
Status |
|---|---|---|
| Llama 1 / 2 / 3 / 3.1 / 3.2 | llama |
✅ tuned specialist |
| Mistral · Mixtral | mistral · mixtral |
✅ tuned specialist |
| Qwen 2 / 2.5 · Qwen-MoE | qwen2 · qwen2_moe |
✅ tuned specialist |
| Gemma 1 / 2 | gemma, gemma2 |
✅ tuned specialist |
| Phi · Phi-3 | phi, phi3 |
🟡 generic decoder-LM |
| DeepSeek V2 | deepseek_v2 |
🟡 generic decoder-LM |
| OLMo · Yi · Falcon | olmo, yi, falcon |
🟡 generic decoder-LM |
| Encoder transformers (BERT, RoBERTa, ViT) | — | ❌ pass-through |
| Image generation (SDXL, SD3, FLUX) | — | ❌ pass-through (ring 2) |
| Speech (Whisper) | — | ❌ pass-through (ring 3) |
Models outside the ring-1 set return unchanged with notes["passthrough"] = True. Honest scoping is a feature — silent no-ops destroy customer trust faster than honest refusal.
LoRA fine-tunes are supported via peft.merge_and_unload() first; hot-swappable adapters are not yet supported.
🛠️ CLI
agnitra --help # full command list (works without torch installed)
agnitra doctor # health check: torch / CUDA / NVML / Ollama / license
agnitra optimize --model my.pt --output optimized.pt
agnitra optimize-dir --models-dir /var/agnitra/fleet --quantize auto
agnitra package --model-dir /models/llama3 --output dist/llama3-nim --as nim
agnitra trust verify --manifest manifest.json
agnitra trust keys generate
agnitra heartbeat --interval 30 # background re-optimization daemon
The CLI loads without torch installed — agnitra --help and agnitra doctor work on a fresh machine before you've finished setting up CUDA.
Fine-tune farms (agnitra optimize-dir)
The killer feature for production fleets running 50+ Llama-3 fine-tunes per customer. The architecture-fingerprint cache reuses optimization decisions across same-architecture variants:
agnitra optimize-dir --models-dir /var/agnitra/fleet --quantize auto
# Optimizing customer-A-llama3 ... (8 minutes — real work)
# Optimizing customer-B-llama3 ... cache hit (same architecture as customer-A) — instant
# Optimizing customer-C-llama3 ... cache hit — instant
# ... 47 more fine-tunes ... all cache hits — instant
🌐 API server (optional)
agnitra-api # binds to 127.0.0.1:8080 by default
Endpoints: POST /optimize · GET /jobs/{id} · GET /health · WebSocket /ws/jobs/{id}.
Override with AGNITRA_API_HOST / AGNITRA_API_PORT. Set AGNITRA_ALLOW_PUBLIC_BIND=1 if you intentionally bind publicly. For browser / Node.js access, use the npm agnitra HTTP client.
📊 Benchmarks
Reproducible H100 benchmark in benchmarks/llama3_h100/. One command on Modal:
HF_TOKEN=hf_xxx modal run benchmarks/llama3_h100/modal_runner.py
Llama-3-8B on H100, batch=1, 512→128 tokens
| Stack | Throughput | Memory | Speedup |
|---|---|---|---|
HuggingFace transformers 4.44.2 |
53 tok/s | 16.4 GB | 1.00× |
torch.compile(reduce-overhead) |
52 tok/s | 16.4 GB | 0.98× |
Agnitra (quantize="int8_weight") |
~75–90 tok/s* | ~8 GB | ~1.4–1.7×* |
Agnitra (quantize="fp8_weight") |
~95–105 tok/s* | ~8 GB | ~1.8–2.0×* |
*INT8/FP8 numbers are predictions based on torchao kernel benchmarks; the live measurement is pending publication. The HF + torch.compile row is real, measured data — the headline finding is that torch.compile no longer wins against HF defaults in transformers 4.44+. See benchmarks/llama3_h100/RESULTS.md.
Five access paths documented (Docker, host venv, Modal, Lambda Labs / RunPod SSH, GitHub Actions self-hosted) — see benchmarks/llama3_h100/README.md.
🟢 NVIDIA ecosystem
Agnitra drives traffic into NVIDIA's stack rather than competing with it.
result = agnitra.optimize(model, backend="tensorrt_llm", backend_kwargs={"engine_dir": "./engine"})
agnitra package --model-dir /models/llama3 --output dist/llama3-nim --target h100
Output is a Triton model repository plus a Dockerfile based on nvcr.io/nvidia/tritonserver. See docs/guides/nvidia.mdx for engine build, NGC catalog publishing, and the NVIDIA Inception program path.
🚫 What Agnitra is not
Honest scope, so you don't waste a day:
- Not a serving runtime. No paged KV cache, continuous batching, or speculative decoding. Pair with vLLM / TGI / SGLang.
- Limited quantization (W8A16 / W4A16 / W8(FP8)A8(FP8)). AWQ / GPTQ are out of scope; Agnitra optimizes already-quantized models but won't re-quantize via those formats.
- Not a trainer. Inference only.
- Not a multi-GPU sharder. Single-GPU optimization. Use
accelerateor vLLM for tensor parallelism. - Not multimodal. Text decoder-LMs only. Image generation, speech, and vision-language models are explicitly ring 2 / 3.
🗺️ Roadmap
- Ring 1 (now): decoder-only LLMs (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, OLMo, Yi, Falcon, Mixtral, Qwen-MoE, Phi-3, Gemma-2)
- Ring 1.5 (in flight): custom Triton kernel fusions (RMSNorm + RoPE), speculative decoding integration, INT4-AWQ
- Ring 2 (planned): image generation — SDXL, SD3, FLUX
- Ring 3 (planned): speech — Whisper, Wav2Vec2
- Trust roadmap: Layer 1 (signed manifests) ✅ shipped → Layer 2 (per-inference provenance tags) → Layer 3 (certified quantization recipes) → Layer 4 (cross-runtime determinism cert) → Layer 5 (ZK proof of inference, research)
- Out of scope: training, multi-GPU sharding, encoder transformers, multimodal pipelines
🔬 Configuration — environment variables
| Variable | Purpose |
|---|---|
AGNITRA_API_HOST / AGNITRA_API_PORT |
API server bind interface (defaults to 127.0.0.1:8080) |
AGNITRA_ALLOW_PUBLIC_BIND |
Set to 1 to silence the public-bind warning |
AGNITRA_API_KEY |
Required header for agnitra-api request authentication |
AGNITRA_TRUST_KEY_PEM |
PEM-encoded signing key, inline (for CI / containers) |
AGNITRA_TRUST_KEY_PATH |
Path to a PEM-encoded signing key file |
OPENAI_API_KEY |
Enables the LLM-guided research path |
AGNITRA_OLLAMA_URL |
Local LLM backend (default http://localhost:11434) |
AGNITRA_LICENSE_PATH |
License file when using enterprise features |
AGNITRA_NOTIFY_WEBHOOK_URL |
Slack / Discord / Telegram completion webhooks |
Full reference: docs/reference/configuration.mdx.
🏗️ Repository layout
agnitra/
__init__.py lazy-loads sdk; `import agnitra` works without torch
sdk.py public optimize() entry point
cli.py Click CLI — optimize / optimize-dir / package / trust / doctor
optimizers/ architecture detection + ring-1 routing
detection.py model_type detection (config + structural fingerprint)
registry.py SUPPORTED_DECODER_LM_TYPES
decoder_lm/ llama / mistral / qwen2 / gemma specialists
_passes.py TF32 / SDPA / static cache / torch.compile
_quantization.py INT8 / INT4 / FP8 / auto via torchao
trust/ signed inference manifests (Layer 1)
manifest.py InferenceManifest schema + canonical bytes
digest.py deterministic model_sha256
keys.py Ed25519 keypair management
sign.py / verify.py Ed25519 signature lifecycle
integrations/ huggingface / accelerate / langchain / llama_index / tensorrt_llm
_sdk/ low-level optimizer (FX, kernels, RL — research path)
core/
runtime/ fingerprint, validation, cache, control plane
kernel/ Triton kernel generation
metering/, billing/ usage events, Stripe integration
licensing/ license validation
notifications/ webhook notifiers
api/ Starlette REST API server (POST /optimize, /jobs, /ws/jobs)
benchmarks/llama3_h100/ reproducible H100 benchmark (5 access paths)
examples/ minimal runnable scripts (HF, LangChain, LlamaIndex, CPU)
js/ TypeScript HTTP client (npm)
docs/ Mintlify documentation site
tests/ 111 tests; runs without GPU
🤝 Contributing
PRs welcome. Three things make a good PR:
- One concern per PR. Bug fixes fix one bug; features add one feature.
- Tests for new behavior. Use the monkeypatched-optimizer pattern in existing tests as your template — most run without GPU or torchao installed.
- CHANGELOG entry for user-visible changes.
The benchmark suite is meant to be adversarially reviewed — if you find Agnitra is handicapping a competitor, open an issue with a specific configuration change. We treat it as signal, not criticism.
Found a security issue? Email security@agnitra.ai (see SECURITY.md when present).
💬 Get involved
- ⭐ Star this repo if Agnitra saved you a Modal bill — it helps signal value to other developers.
- 💬 GitHub Discussions — the place for "how do I…" questions and design proposals.
- 🐛 GitHub Issues — bugs, feature requests, benchmark handicap reports.
- 📦 PyPI · npm · Docs
Star history
📄 License & acknowledgments
Apache 2.0 — see LICENSE.
Agnitra is built on torch, transformers, torchao, accelerate, and the broader PyTorch ecosystem. We drive traffic into TensorRT-LLM and vLLM where appropriate rather than competing with them.
The honest negative result we shipped — "torch.compile is now a no-op vs HuggingFace baseline on Llama-3-8B in transformers 4.44+" — was made possible by Meta's relentless improvements to transformers defaults. Real progress shows up as commoditization, and we're glad to see it.
The Layer 1 trust system leans on the cryptography project and the Ed25519 / EdDSA work originally by Bernstein, Duif, Lange, Schwabe, and Yang.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agnitra-0.2.4.tar.gz.
File metadata
- Download URL: agnitra-0.2.4.tar.gz
- Upload date:
- Size: 205.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37ec92d481a78b7841d3b67f4cd40673a670a056322f050af9785e7446f54db5
|
|
| MD5 |
21ec21c6e16cefb6769db975e3ad9b4a
|
|
| BLAKE2b-256 |
429a0e1157d28ead2037180d76fa61620e34d86f7ebcff74f31212c9f78b52ae
|
File details
Details for the file agnitra-0.2.4-py3-none-any.whl.
File metadata
- Download URL: agnitra-0.2.4-py3-none-any.whl
- Upload date:
- Size: 199.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51dca132948d3368a440e18324407fc1ef84fe9f862f761886a9943d02ee343e
|
|
| MD5 |
710d2d5dac051bb20e01a46c8e3634c4
|
|
| BLAKE2b-256 |
5a873926ebd8589fd044e30b7d1de5dc2867ee501d7bc1438a4684a3b807c329
|