Skip to main content

Inference optimizer for decoder-only LLMs. One-line drop-in for HuggingFace models.

Project description

Agnitra

The inference optimizer for decoder-only LLMs. One line, no retraining, faster than torch.compile on the architectures you actually run in production.

PyPI version Python License: Apache 2.0

import agnitra

result = agnitra.optimize(model, input_shape=(1, 512))
fast_model = result.optimized_model

That's it. No retraining. No graph rewrites by hand. No serving stack to adopt.

Supported architectures

Agnitra is intentionally narrow. The wedge is decoder-only LLMs — Llama-class models that account for ~80% of LLM inference spend in production. Every fine-tune of every supported architecture inherits the optimization decisions of its base model via architecture fingerprinting, so "13 architectures supported" effectively means "the ~100K decoder-LM fine-tunes on HuggingFace."

Architecture model_type Reference model Status
Llama 1/2/3 llama meta-llama/Meta-Llama-3-8B-Instruct ✅ tuned specialist
Mistral mistral mistralai/Mistral-7B-Instruct-v0.3 ✅ tuned specialist
Mixtral mixtral mistralai/Mixtral-8x7B-Instruct-v0.1 ✅ tuned specialist
Qwen 2 / 2.5 qwen2 Qwen/Qwen2.5-7B-Instruct ✅ tuned specialist
Qwen 2 MoE qwen2_moe Qwen/Qwen2.5-MoE ✅ tuned specialist
Gemma 1 / 2 gemma / gemma2 google/gemma-2-9b-it ✅ tuned specialist
Phi / Phi-3 phi / phi3 microsoft/Phi-3-mini-4k-instruct 🟡 generic decoder-LM
DeepSeek V2 deepseek_v2 deepseek-ai/DeepSeek-V2-Lite 🟡 generic decoder-LM
OLMo, Yi, Falcon olmo / yi / falcon allenai/OLMo-7B 🟡 generic decoder-LM
Encoder transformers (BERT, RoBERTa, ViT) ❌ pass-through
Image generation (SDXL, FLUX) ❌ pass-through (ring 2)
Speech (Whisper) ❌ pass-through (ring 3)

When a model is outside the ring-1 set, agnitra.optimize returns the input model unchanged with result.notes["passthrough"] = True and the detected architecture string. Honest scoping is a feature — a silent 5% no-op speedup destroys customer trust faster than honest refusal.

LoRA fine-tunes are supported via peft.merge_and_unload() first; hot-swappable adapters are not yet supported.

Roadmap rings

  • Ring 1 (now): decoder-only LLMs. Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, etc.
  • Ring 2 (planned): image generation. SDXL, SD3, FLUX. Different optimization landscape (UNet attention, classifier-free guidance batching, VAE decode).
  • Ring 3 (planned): speech. Whisper, Wav2Vec2.
  • Out of scope: encoder transformers, multimodal pipelines, image classification, training-time optimization, multi-GPU sharding.

Status

Beta. The optimizer works end-to-end on real models. Public benchmark numbers vs. torch.compile / vLLM / TensorRT-LLM are pending the first H100 run — see benchmarks/llama3_h100/RESULTS.md. Until those numbers are published, treat any "2x faster" claim as unverified.

Install

pip install agnitra

Optional extras:

pip install "agnitra[openai]"   # LLM-guided kernel suggestions via OpenAI
pip install "agnitra[rl]"       # PPO-guided search (Stable-Baselines3)
pip install "agnitra[nvml]"     # GPU telemetry via pynvml

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import agnitra

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.float16
).cuda()

result = agnitra.optimize(model, input_shape=(1, 512), enable_rl=False)
fast = result.optimized_model

# Use `fast` everywhere you used `model` before.

A complete runnable script lives at examples/quickstart.py.

Want a real speedup? Quantize.

The TF32 + SDPA + torch.compile defaults match what HuggingFace does today; the optimization that actually beats transformers baseline on Llama-class models is quantization. Four modes:

result = agnitra.optimize(
    model,
    input_shape=(1, 512),
    quantize="auto",   # picks FP8 on H100/Blackwell, INT8 elsewhere
)
Mode vs FP16 baseline When to use
"int8_weight" ~2× memory, ~1.3-1.7× throughput, near-zero quality drop Default safe choice; any CUDA GPU
"int4_weight" ~4× memory, ~1.6-2.0× throughput, mild quality drop Memory-bound decode on smaller GPUs (4090, A40, L4)
"fp8_weight" ~2× throughput on H100/Blackwell tensor cores, near-zero quality drop Highest-end NVIDIA hardware
"auto" picks FP8 on Hopper+, INT8 elsewhere Recommended for portable code

All four require torchao — install via pip install "agnitra[quantize]".

Integrations

HuggingFace transformers

Replace AutoModelForCausalLM with AgnitraModel. Everything else stays identical:

from agnitra.integrations.huggingface import AgnitraModel

model = AgnitraModel.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    agnitra_kwargs={"input_shape": (1, 512)},
).cuda()
# Use `model` like a normal transformers model — tokenizer, .generate(), logits.

Pass any transformers.AutoModelFor... class via model_class= for non-CausalLM workloads. See examples/quickstart_hf.py.

For an existing transformers.pipeline():

from agnitra.integrations.huggingface import optimize_pipeline

pipe = transformers.pipeline("text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct")
optimize_pipeline(pipe, agnitra_kwargs={"input_shape": (1, 512)})

accelerate

For users who go through accelerate.Accelerator, run Agnitra after prepare():

from accelerate import Accelerator
from agnitra.integrations.accelerate_helpers import optimize_after_prepare

accelerator = Accelerator()
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
model = optimize_after_prepare(model, input_shape=(1, 512))

What Agnitra does

  1. Profiles the model on real input shapes via torch.profiler + NVML telemetry.
  2. Suggests kernel tuning parameters using either an LLM (OpenAI / Ollama) or a deterministic policy when no LLM is available.
  3. Applies safe-by-default optimizations: TF32 matmul, FlashAttention / SDPA, torch.compile with the right mode, optional fused Triton kernels for matmul / layer-norm.
  4. Verifies the optimized model produces the same outputs as the baseline.
  5. Returns the patched nn.Module plus a structured report (RuntimeOptimizationResult).

What Agnitra is not

Honest scope, so you don't waste a day:

  • Not a serving runtime. It does not implement paged KV cache, continuous batching, or speculative decoding. Pair Agnitra with vLLM / TGI / your own serving stack.
  • Limited quantization (W8A16 only). Agnitra supports INT8 weight-only quantization via quantize="int8_weight", delegating to torchao. This is the optimization that beats plain HuggingFace + torch.compile (HF doesn't quantize by default). INT4 / activation quantization / AWQ / GPTQ are out of scope; if you have a model already quantized via those, Agnitra will optimize it but won't re-quantize.
  • Not a trainer. Inference only. Training-time optimization is out of scope.
  • Not a multi-GPU sharder. Single-GPU optimization. Tensor parallelism is a separate problem.

Benchmarks

Reproducible benchmarks live in benchmarks/. The headline suite is benchmarks/llama3_h100/ — a one-command repro comparing Agnitra to HuggingFace transformers, torch.compile, vLLM, and TensorRT-LLM on Llama-3-8B at batch sizes 1, 8, 32.

Five access paths are documented (Docker, host venv, Modal, Lambda Labs / RunPod SSH, GitHub Actions self-hosted). The cheapest is Modal:

pip install modal && modal token new
HF_TOKEN=hf_xxx modal run benchmarks/llama3_h100/modal_runner.py

RESULTS.md is regenerated by the benchmark CI workflow on every release tag and gates merges on a >5% throughput regression vs. the previous baseline.

LangChain

Agents call the LLM hundreds of times per task. A 1.5x speedup on the model becomes a 1.5x reduction in the agent's wall-clock time.

from langchain_huggingface import HuggingFacePipeline
from agnitra.integrations.langchain import optimize_llm

llm = HuggingFacePipeline.from_model_id("meta-llama/Meta-Llama-3-8B-Instruct", task="text-generation")
optimize_llm(llm, agnitra_kwargs={"input_shape": (1, 512), "quantize": "int8_weight"})
# Use `llm` exactly as before — all chains/agents downstream get the speedup.

See examples/quickstart_langchain.py.

LlamaIndex

Same pattern for RAG and agent flows.

from llama_index.llms.huggingface import HuggingFaceLLM
from agnitra.integrations.llama_index import optimize_llm

llm = HuggingFaceLLM(model_name="meta-llama/Meta-Llama-3-8B-Instruct", ...)
optimize_llm(llm, agnitra_kwargs={"input_shape": (1, 512), "quantize": "int8_weight"})

See examples/quickstart_llama_index.py.

CLI

agnitra optimize --model my_model.pt --input-shape 1,3,224,224 --output optimized.pt
agnitra optimize-dir --models-dir /var/agnitra/fleet  # batch-optimize a fine-tune farm
agnitra doctor                    # health check: torch / CUDA / NVML / Ollama / license
agnitra heartbeat --interval 30   # background re-optimization daemon

Fine-tune farms (agnitra optimize-dir)

If you run 50+ fine-tuned variants of the same base model in production (one per customer, one per use-case), Agnitra's architecture-fingerprint cache means the second model with the same architecture inherits the first's optimization decisions. Optimizing 50 Llama-3-8B fine-tunes takes nearly the same wall time as optimizing one.

agnitra optimize-dir \
  --models-dir /var/agnitra/fleet \
  --quantize int8_weight \
  --input-shape 1,512

The directory layout is the standard HuggingFace shape — one subdirectory per model, each containing config.json plus weights.

API server (optional)

If you want to call the optimizer remotely (CI workers, hosted inference):

agnitra-api    # binds to 127.0.0.1:8080 by default; AGNITRA_API_HOST overrides

The server exposes POST /optimize, GET /jobs/{id}, GET /health, and WebSocket /ws/jobs/{id} for live status. By default it listens on localhost; set AGNITRA_ALLOW_PUBLIC_BIND=1 if you intentionally bind to a public interface.

NVIDIA ecosystem

Agnitra drives traffic into NVIDIA's stack rather than competing with it. Most HuggingFace developers cannot use TensorRT-LLM directly because it requires C++ and a multi-step engine build. Agnitra wraps that:

result = agnitra.optimize(
    model,
    backend="tensorrt_llm",
    backend_kwargs={"engine_dir": "./engine"},
)

For deployable containers, package an Agnitra-optimized model as a NIM-compatible Triton bundle:

agnitra package --model-dir /models/llama3 --output dist/llama3-nim --target h100

Output is a Triton model repository plus a Dockerfile based on nvcr.io/nvidia/tritonserver. See docs/guides/nvidia for engine build steps, NGC catalog publishing, and the NVIDIA Inception program path.

Configuration

Environment variable Purpose
OPENAI_API_KEY Enables the LLM-guided suggestion path.
AGNITRA_OLLAMA_URL Local LLM backend (default http://localhost:11434).
AGNITRA_API_HOST / AGNITRA_API_PORT API server bind interface.
AGNITRA_LICENSE_PATH Path to a license file when using enterprise features.
AGNITRA_NOTIFY_WEBHOOK_URL POST optimization results to Slack / Discord / Telegram.

Repository layout

agnitra/
  sdk.py                    public optimize() entry point
  cli.py                    Click CLI
  _sdk/                     low-level optimizer (FX, kernels, RL)
  core/
    optimizer/              LLM- and RL-guided optimization
    runtime/                runtime patching, telemetry, fingerprinting, cache
    kernel/                 Triton kernel generation
    metering/, billing/     usage events and Stripe integration
    licensing/              license validation
    notifications/          webhook notifiers
  api/                      Starlette REST API server
benchmarks/                 reproducible benchmark suites
examples/                   small focused examples
tests/                      pytest suite

Contributing

Bug reports, benchmark PRs, and "you're handicapping vLLM" issues are all welcome — the benchmark suite is meant to be adversarially reviewed. Open an issue or PR.

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agnitra-0.2.2.tar.gz (186.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agnitra-0.2.2-py3-none-any.whl (182.6 kB view details)

Uploaded Python 3

File details

Details for the file agnitra-0.2.2.tar.gz.

File metadata

  • Download URL: agnitra-0.2.2.tar.gz
  • Upload date:
  • Size: 186.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agnitra-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c6d25be6c8822e275e27842b2ac83b1250917ce9908ec04cb9d43c17e09b4023
MD5 c0a91e2a4c02820d5669541cd9322c22
BLAKE2b-256 eae9dca262505068520e0c32b42ebf541b486e78323aaaf56371d40b90eb4bcd

See more details on using hashes here.

File details

Details for the file agnitra-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: agnitra-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 182.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for agnitra-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 902dc0f437dad85883f3c45db1fa8c5342185523c717a7ef4c73ac960b2e4f63
MD5 caf8d338c1cdf7aade4cbed896a3d376
BLAKE2b-256 ffe90a1b67317ccbbe16b6e68ccc357db13a87301585f783c4ae48609fa413f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page