Skip to main content

UARC - Unified Adaptive Runtime Core: Production AI inference engine with EADS Speculative Decoding

Project description

UARC — Unified Adaptive Runtime Core 🚀

PyPI version Python Versions License: MIT

UARC is a lightweight, production-ready AI inference engine for Python. It provides a single, unified gateway for running Large Language Models locally, seamlessly bridging the gap between different backends like Ollama and Llama.cpp.

Stop rewriting your inference code every time you switch backends. With UARC, you get a zero-config CLI and an instant OpenAI-compatible server out of the box.


⚡ Quick Start

Install UARC globally via pip:

pip install uarc

1. Instant CLI Inference

Run models directly from your terminal. UARC auto-detects your backend (Ollama, local weights, etc.) and streams the response.

uarc run "Explain quantum computing in simple terms" --model llama3.2 --stream

2. Drop-in OpenAI Server

Need an API? Spin up an OpenAI-compatible server in one command. Point tools like AutoGen, LangChain, or your custom apps to localhost:8000.

uarc serve --port 8000 --model llama3.2

Test it immediately:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

3. Built-in Benchmarking

Test your hardware or model quantizations instantly. Get P50/P99 latencies, tokens/sec, and hardware stats down to the millisecond.

uarc bench --requests 100 --model llama3.2

4. Production-Grade EADS Speculative Decoding 🆕

UARC now includes a world-class Entropy-Aware Dynamic Speculator (EADS). This engine provides a 3.2x speedup on average by intelligently drafting tokens with a smaller model and verifying them in parallel with the target model.

from uarc import UARCRuntime, UARCConfig

cfg = UARCConfig()
cfg.backend = "hf" # HuggingFace backend
cfg.model_name = "gpt2"
cfg.draft_model_name = "distilgpt2" # Enable EADS automatically
cfg.enable_eads = True

rt = UARCRuntime(cfg)
rt.start()
# ... inference is now accelerated ...

🧠 Why UARC? (The Core Architecture)

UARC goes beyond just being a wrapper. It features a pipeline of experimental adaptive modules designed to maximize efficiency on consumer hardware:

Module Name Purpose Status
EADS Entropy-Aware Dynamic Speculator Dynamic speculative decoding with real-time K-adjustment. Production
TDE Token Difficulty Estimator Predicts token difficulty to route between draft/full models. Beta
AI-VM Virtual Memory Manager Intelligent 3-tier memory management (VRAM → RAM → NVMe). Beta
DPE Dynamic Precision Engine Per-layer bit-width allocation for memory constraints. Research
PLL Predictive Layer Loader Async layer loading from NVMe preventing pipeline stalls. Research
NSC Neural Semantic Cache Embedding-based prompt deduplication. Beta

(Note: Adaptive routing and caching are actively being developed for the uarc core package).


💻 Python API

Integrate UARC directly into your Python applications for maximum control:

from uarc import UARCRuntime, UARCConfig, InferenceRequest

# 1. Configure
cfg = UARCConfig()
cfg.backend = "auto" # Auto-detects Ollama or llama_cpp
cfg.model_name = "llama3.2:1b"

# 2. Initialize
rt = UARCRuntime(cfg)
rt.start()

# 3. Infer
req = InferenceRequest(
    request_id="req-001",
    prompt="Write a python script to reverse a string.",
    max_new_tokens=256
)

response = rt.infer(req)

print(f"Output: {response.text}")
print(f"Speed:  {response.tokens_per_second:.1f} tok/s")

rt.stop()

🛠️ Development & Contributing

Want to help build the ultimate unified inference engine? We'd love your contributions!

git clone https://github.com/Shivay00001/uarc.git
cd uarc
pip install -e ".[dev]"
pytest tests/ -v

📄 License

UARC is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uarc-0.3.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uarc-0.3.0-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file uarc-0.3.0.tar.gz.

File metadata

  • Download URL: uarc-0.3.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for uarc-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4573c677d397fae3bc405352c5ea5f0bcdb241ffb4ae3af0b5a410f7888f4232
MD5 9094a565e83f0d94060722677900c425
BLAKE2b-256 a7fe05c3665b2dfb5309c3607ba24d02dcd0ff1f4e40fc14abd841b788bccd68

See more details on using hashes here.

File details

Details for the file uarc-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: uarc-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for uarc-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46e7d45b499ba3dd8c6aa3323923f1da05213506edd7fe38124e5021e4ab1f32
MD5 ed064ade0e1f3c54db814fcc4b546331
BLAKE2b-256 ffad0d97bb4c019f30075c072e84932dd3cf9801d58d4323c1cf01fe8513a75e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page