Inference optimizer for decoder-only LLMs. One-line drop-in for HuggingFace models.

These details have not been verified by PyPI

Project links

Project description

Agnitra

The inference optimizer for decoder-only LLMs. One Python keyword. No retraining. 2× memory ↓ · 1.5–2× throughput ↑ · cryptographically signed.

Quickstart · Why · Integrations · Quantization · Trust · CLI · Benchmarks · Roadmap · Contributing

⚡ Quickstart

pip install "agnitra[quantize]"

import torch
from agnitra.integrations.huggingface import AgnitraModel

model = AgnitraModel.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",                         # open weights — no HF token
    torch_dtype=torch.float16,
    agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()

# Use `model` exactly like a HuggingFace model — tokenizer, .generate(), logits.
# Same outputs, lower memory, higher throughput.

quantize="auto" picks FP8 on H100/Blackwell and INT8 elsewhere. The full runnable script is at examples/quickstart.py.

🎯 Why Agnitra

torch.compile is now a no-op against HuggingFace defaults on Llama-3-8B in transformers 4.44+. We measured it. The wedge has narrowed — quantization is the lever that's left.

One line, not a serving stack. vLLM and TensorRT-LLM are serving runtimes requiring Python-side rewrites. Agnitra is an SDK — drop it into your existing model.generate() code.
Quantization, automatic. HuggingFace doesn't quantize by default. Agnitra picks the best mode for your GPU (FP8 / INT8 / INT4) and falls back gracefully when hardware can't run it.
Honest scoping. Models outside the supported set get a passthrough RuntimeOptimizationResult with notes["passthrough"] = True. We never silently no-op.

📦 Install

pip install agnitra                    # base SDK — works without torch installed
pip install "agnitra[quantize]"        # + INT8/INT4/FP8 via torchao  (recommended)
pip install "agnitra[trust]"           # + Ed25519 signed inference manifests
pip install "agnitra[quantize,trust]"  # combined (most production deployments)

Other extras

pip install "agnitra[openai]"          # + LLM-guided research path
pip install "agnitra[rl]"              # + PPO-guided research path
pip install "agnitra[nvml]"            # + GPU telemetry
pip install "agnitra[marketplace]"     # (deprecated — kept for back-compat)

npm install agnitra                    # JS/TS HTTP client for agnitra-api
                                       # NOT a port — calls a hosted server

🔌 Integrations

Five drop-in entry points — same wedge across every popular LLM framework.

HuggingFace `transformers`

from agnitra.integrations.huggingface import AgnitraModel

model = AgnitraModel.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    agnitra_kwargs={"input_shape": (1, 512), "quantize": "auto"},
).cuda()

Drop-in for AutoModelForCausalLM.from_pretrained. Pass model_class=AutoModelForSeq2SeqLM for non-CausalLM. Or swap inside an existing transformers.pipeline:

from agnitra.integrations.huggingface import optimize_pipeline
optimize_pipeline(pipe, agnitra_kwargs={"input_shape": (1, 512)})

LangChain

Agents call the LLM many times per task — model speedups compound into pipeline speedups.

from langchain_huggingface import HuggingFacePipeline
from agnitra.integrations.langchain import optimize_llm

llm = HuggingFacePipeline.from_model_id("...", task="text-generation")
optimize_llm(llm, agnitra_kwargs={"quantize": "auto"})
# Every chain / agent downstream inherits the speedup.

Auto-detects langchain_huggingface, langchain_community, and legacy paths.

LlamaIndex

from llama_index.llms.huggingface import HuggingFaceLLM
from agnitra.integrations.llama_index import optimize_llm
optimize_llm(llm, agnitra_kwargs={"quantize": "auto"})

Same compounding pattern for RAG and agent flows.

`accelerate`

Run after accelerator.prepare() so device placement is already done:

from accelerate import Accelerator
from agnitra.integrations.accelerate_helpers import optimize_after_prepare

accelerator = Accelerator()
model = accelerator.prepare(model)
model = optimize_after_prepare(model, input_shape=(1, 512))

NVIDIA TensorRT-LLM

Wraps a pre-built TensorRT-LLM engine in a HuggingFace-shaped runtime:

result = agnitra.optimize(
    model,
    backend="tensorrt_llm",
    backend_kwargs={"engine_dir": "./engine"},
)

See docs/guides/nvidia.mdx for engine build, NIM packaging, and the NVIDIA Inception path.

🔧 Quantization

The single biggest cost lever in modern inference. Pick one or use "auto":

Mode	Memory	Throughput	Quality	When
`"int8_weight"`	2× ↓	~1.5× ↑	~unchanged	Default safe choice; any CUDA GPU
`"int4_weight"`	4× ↓	~1.8× ↑	mild drop	Memory-bound decode; smaller GPUs (4090, A40, L4)
`"fp8_weight"`	2× ↓	~2× ↑	~unchanged	Hopper / Blackwell tensor cores
`"auto"`	best for your GPU	—	—	Recommended portable default

result = agnitra.optimize(model, input_shape=(1, 512), quantize="auto")

All four modes wrap torchao. Install via pip install "agnitra[quantize]".

🔒 Trust & provenance

Every successful agnitra.optimize() produces a cryptographically signed inference manifest — a tamper-evident record of base model SHA-256, optimizations applied, drift verification metrics, runtime context, and signer identity. Required for regulated deployments (banking, healthcare, EU AI Act high-risk systems, FDA SaMD).

result = agnitra.optimize(model, input_shape=(1, 512), quantize="auto")

print(result.notes["trust_manifest"]["signature"])              # ed25519:...
print(result.notes["trust_manifest"]["base_model"]["sha256"])   # 9f2b...

agnitra trust verify --manifest manifest.json
# OK  signed by key_id=8f3b1c2d4e5a6b7c

agnitra trust keys generate              # writes ~/.agnitra/keys/signing.pem (mode 0600)
agnitra trust keys show                  # public key fingerprint only — never private
agnitra trust inspect --manifest m.json  # pretty-print without verifying

Install with pip install "agnitra[trust]". The cryptography dep is fully optional — trust signing silently no-ops when missing. See docs/guides/trust.mdx for the manifest schema, key management, and the Layer 1–5 trust roadmap (Layer 1 ships now; per-inference provenance tags, certified quantization recipes, cross-runtime determinism, and ZK proofs of inference are the longer arc).

🤖 Supported architectures

13 decoder-LM model_type values cover ~80% of LLM inference spend. Every fine-tune of a supported architecture inherits the base model's optimization decisions via architecture fingerprinting — 13 architectures effectively means ~100K HuggingFace fine-tunes.

Architecture	`model_type`	Status
Llama 1 / 2 / 3 / 3.1 / 3.2	`llama`	✅ tuned specialist
Mistral · Mixtral	`mistral` · `mixtral`	✅ tuned specialist
Qwen 2 / 2.5 · Qwen-MoE	`qwen2` · `qwen2_moe`	✅ tuned specialist
Gemma 1 / 2	`gemma`, `gemma2`	✅ tuned specialist
Phi · Phi-3	`phi`, `phi3`	🟡 generic decoder-LM
DeepSeek V2	`deepseek_v2`	🟡 generic decoder-LM
OLMo · Yi · Falcon	`olmo`, `yi`, `falcon`	🟡 generic decoder-LM
Encoder transformers (BERT, RoBERTa, ViT)	—	❌ pass-through
Image generation (SDXL, SD3, FLUX)	—	❌ pass-through (ring 2)
Speech (Whisper)	—	❌ pass-through (ring 3)

Models outside the ring-1 set return unchanged with notes["passthrough"] = True. Honest scoping is a feature — silent no-ops destroy customer trust faster than honest refusal.

LoRA fine-tunes are supported via peft.merge_and_unload() first; hot-swappable adapters are not yet supported.

🛠️ CLI

agnitra --help                    # full command list (works without torch installed)
agnitra doctor                    # health check: torch / CUDA / NVML / Ollama / license
agnitra optimize --model my.pt --output optimized.pt
agnitra optimize-dir --models-dir /var/agnitra/fleet --quantize auto
agnitra package --model-dir /models/llama3 --output dist/llama3-nim --as nim
agnitra trust verify --manifest manifest.json
agnitra trust keys generate
agnitra heartbeat --interval 30   # background re-optimization daemon

The CLI loads without torch installed — agnitra --help and agnitra doctor work on a fresh machine before you've finished setting up CUDA.

Fine-tune farms (`agnitra optimize-dir`)

The killer feature for production fleets running 50+ Llama-3 fine-tunes per customer. The architecture-fingerprint cache reuses optimization decisions across same-architecture variants:

agnitra optimize-dir --models-dir /var/agnitra/fleet --quantize auto
# Optimizing customer-A-llama3 ...    (8 minutes — real work)
# Optimizing customer-B-llama3 ...    cache hit (same architecture as customer-A) — instant
# Optimizing customer-C-llama3 ...    cache hit — instant
# ... 47 more fine-tunes ...          all cache hits — instant

🌐 API server (optional)

agnitra-api    # binds to 127.0.0.1:8080 by default

Endpoints: POST /optimize · GET /jobs/{id} · GET /health · WebSocket /ws/jobs/{id}. Override with AGNITRA_API_HOST / AGNITRA_API_PORT. Set AGNITRA_ALLOW_PUBLIC_BIND=1 if you intentionally bind publicly. For browser / Node.js access, use the npm agnitra HTTP client.

📊 Benchmarks

Reproducible H100 benchmark in benchmarks/llama3_h100/. One command on Modal:

HF_TOKEN=hf_xxx modal run benchmarks/llama3_h100/modal_runner.py

Llama-3-8B on H100, batch=1, 512→128 tokens

Stack	Throughput	Memory	Speedup
HuggingFace `transformers` 4.44.2	53 tok/s	16.4 GB	1.00×
`torch.compile(reduce-overhead)`	52 tok/s	16.4 GB	0.98×
Agnitra (`quantize="int8_weight"`)	~75–90 tok/s*	~8 GB	~1.4–1.7×*
Agnitra (`quantize="fp8_weight"`)	~95–105 tok/s*	~8 GB	~1.8–2.0×*

*INT8/FP8 numbers are predictions based on torchao kernel benchmarks; the live measurement is pending publication. The HF + torch.compile row is real, measured data — the headline finding is that torch.compile no longer wins against HF defaults in transformers 4.44+. See benchmarks/llama3_h100/RESULTS.md.

Five access paths documented (Docker, host venv, Modal, Lambda Labs / RunPod SSH, GitHub Actions self-hosted) — see benchmarks/llama3_h100/README.md.

🟢 NVIDIA ecosystem

Agnitra drives traffic into NVIDIA's stack rather than competing with it.

result = agnitra.optimize(model, backend="tensorrt_llm", backend_kwargs={"engine_dir": "./engine"})

agnitra package --model-dir /models/llama3 --output dist/llama3-nim --target h100

Output is a Triton model repository plus a Dockerfile based on nvcr.io/nvidia/tritonserver. See docs/guides/nvidia.mdx for engine build, NGC catalog publishing, and the NVIDIA Inception program path.

🚫 What Agnitra is not

Honest scope, so you don't waste a day:

Not a serving runtime. No paged KV cache, continuous batching, or speculative decoding. Pair with vLLM / TGI / SGLang.
Limited quantization (W8A16 / W4A16 / W8(FP8)A8(FP8)). AWQ / GPTQ are out of scope; Agnitra optimizes already-quantized models but won't re-quantize via those formats.
Not a trainer. Inference only.
Not a multi-GPU sharder. Single-GPU optimization. Use accelerate or vLLM for tensor parallelism.
Not multimodal. Text decoder-LMs only. Image generation, speech, and vision-language models are explicitly ring 2 / 3.

🗺️ Roadmap

Ring 1 (now): decoder-only LLMs (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, OLMo, Yi, Falcon, Mixtral, Qwen-MoE, Phi-3, Gemma-2)
Ring 1.5 (in flight): custom Triton kernel fusions (RMSNorm + RoPE), speculative decoding integration, INT4-AWQ
Ring 2 (planned): image generation — SDXL, SD3, FLUX
Ring 3 (planned): speech — Whisper, Wav2Vec2
Trust roadmap: Layer 1 (signed manifests) ✅ shipped → Layer 2 (per-inference provenance tags) → Layer 3 (certified quantization recipes) → Layer 4 (cross-runtime determinism cert) → Layer 5 (ZK proof of inference, research)
Out of scope: training, multi-GPU sharding, encoder transformers, multimodal pipelines

🔬 Configuration — environment variables

Variable	Purpose
`AGNITRA_API_HOST` / `AGNITRA_API_PORT`	API server bind interface (defaults to `127.0.0.1:8080`)
`AGNITRA_ALLOW_PUBLIC_BIND`	Set to `1` to silence the public-bind warning
`AGNITRA_API_KEY`	Required header for `agnitra-api` request authentication
`AGNITRA_TRUST_KEY_PEM`	PEM-encoded signing key, inline (for CI / containers)
`AGNITRA_TRUST_KEY_PATH`	Path to a PEM-encoded signing key file
`OPENAI_API_KEY`	Enables the LLM-guided research path
`AGNITRA_OLLAMA_URL`	Local LLM backend (default `http://localhost:11434`)
`AGNITRA_LICENSE_PATH`	License file when using enterprise features
`AGNITRA_NOTIFY_WEBHOOK_URL`	Slack / Discord / Telegram completion webhooks

Full reference: docs/reference/configuration.mdx.

🏗️ Repository layout

agnitra/
  __init__.py             lazy-loads sdk; `import agnitra` works without torch
  sdk.py                  public optimize() entry point
  cli.py                  Click CLI — optimize / optimize-dir / package / trust / doctor
  optimizers/             architecture detection + ring-1 routing
    detection.py          model_type detection (config + structural fingerprint)
    registry.py           SUPPORTED_DECODER_LM_TYPES
    decoder_lm/           llama / mistral / qwen2 / gemma specialists
      _passes.py          TF32 / SDPA / static cache / torch.compile
      _quantization.py    INT8 / INT4 / FP8 / auto via torchao
  trust/                  signed inference manifests (Layer 1)
    manifest.py           InferenceManifest schema + canonical bytes
    digest.py             deterministic model_sha256
    keys.py               Ed25519 keypair management
    sign.py / verify.py   Ed25519 signature lifecycle
  integrations/           huggingface / accelerate / langchain / llama_index / tensorrt_llm
  _sdk/                   low-level optimizer (FX, kernels, RL — research path)
  core/
    runtime/              fingerprint, validation, cache, control plane
    kernel/               Triton kernel generation
    metering/, billing/   usage events, Stripe integration
    licensing/            license validation
    notifications/        webhook notifiers
  api/                    Starlette REST API server (POST /optimize, /jobs, /ws/jobs)
benchmarks/llama3_h100/   reproducible H100 benchmark (5 access paths)
examples/                 minimal runnable scripts (HF, LangChain, LlamaIndex, CPU)
js/                       TypeScript HTTP client (npm)
docs/                     Mintlify documentation site
tests/                    111 tests; runs without GPU

🤝 Contributing

PRs welcome. Three things make a good PR:

One concern per PR. Bug fixes fix one bug; features add one feature.
Tests for new behavior. Use the monkeypatched-optimizer pattern in existing tests as your template — most run without GPU or torchao installed.
CHANGELOG entry for user-visible changes.

The benchmark suite is meant to be adversarially reviewed — if you find Agnitra is handicapping a competitor, open an issue with a specific configuration change. We treat it as signal, not criticism.

Found a security issue? Email security@agnitra.ai (see SECURITY.md when present).

💬 Get involved

⭐ Star this repo if Agnitra saved you a Modal bill — it helps signal value to other developers.
💬 GitHub Discussions — the place for "how do I…" questions and design proposals.
🐛 GitHub Issues — bugs, feature requests, benchmark handicap reports.
📦 PyPI · npm · Docs

Star history

📄 License & acknowledgments

Apache 2.0 — see LICENSE.

Agnitra is built on torch, transformers, torchao, accelerate, and the broader PyTorch ecosystem. We drive traffic into TensorRT-LLM and vLLM where appropriate rather than competing with them.

The honest negative result we shipped — "torch.compile is now a no-op vs HuggingFace baseline on Llama-3-8B in transformers 4.44+" — was made possible by Meta's relentless improvements to transformers defaults. Real progress shows up as commoditization, and we're glad to see it.

The Layer 1 trust system leans on the cryptography project and the Ed25519 / EdDSA work originally by Bernstein, Duif, Lange, Schwabe, and Yang.

⬆ back to top

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

May 6, 2026

0.2.3

May 6, 2026

0.2.2

May 6, 2026

0.2.1

May 6, 2026

0.2.0

May 6, 2026

0.1.0

Oct 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agnitra-0.2.4.tar.gz (205.4 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agnitra-0.2.4-py3-none-any.whl (199.6 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file agnitra-0.2.4.tar.gz.

File metadata

Download URL: agnitra-0.2.4.tar.gz
Upload date: May 6, 2026
Size: 205.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agnitra-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`37ec92d481a78b7841d3b67f4cd40673a670a056322f050af9785e7446f54db5`
MD5	`21ec21c6e16cefb6769db975e3ad9b4a`
BLAKE2b-256	`429a0e1157d28ead2037180d76fa61620e34d86f7ebcff74f31212c9f78b52ae`

See more details on using hashes here.

File details

Details for the file agnitra-0.2.4-py3-none-any.whl.

File metadata

Download URL: agnitra-0.2.4-py3-none-any.whl
Upload date: May 6, 2026
Size: 199.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agnitra-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51dca132948d3368a440e18324407fc1ef84fe9f862f761886a9943d02ee343e`
MD5	`710d2d5dac051bb20e01a46c8e3634c4`
BLAKE2b-256	`5a873926ebd8589fd044e30b7d1de5dc2867ee501d7bc1438a4684a3b807c329`

See more details on using hashes here.

agnitra 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Agnitra

⚡ Quickstart

🎯 Why Agnitra

📦 Install

🔌 Integrations

HuggingFace transformers

LangChain

LlamaIndex

accelerate

NVIDIA TensorRT-LLM

🔧 Quantization

🔒 Trust & provenance

🤖 Supported architectures

🛠️ CLI

Fine-tune farms (agnitra optimize-dir)

🌐 API server (optional)

📊 Benchmarks

Llama-3-8B on H100, batch=1, 512→128 tokens

🟢 NVIDIA ecosystem

🚫 What Agnitra is not

🗺️ Roadmap

🤝 Contributing

💬 Get involved

Star history

📄 License & acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

HuggingFace `transformers`

`accelerate`

Fine-tune farms (`agnitra optimize-dir`)