UARC - Unified Adaptive Runtime Core: Production AI inference engine
Project description
UARC — Unified Adaptive Runtime Core
Next-generation AI inference engine: 7 adaptive modules, portable, OpenAI-compatible.
pip install uarc
Modules
| # | Module | Purpose | Key Algorithm |
|---|---|---|---|
| M1 | TDE — Token Difficulty Estimator | Per-token compute routing | 4-layer MLP → perplexity → draft/partial/full |
| M2 | AI-VM — Virtual Memory Manager | 3-tier memory (VRAM→RAM→NVMe) | CLOCK-Pro eviction + KV page table |
| M3 | DPE — Dynamic Precision Engine | Per-layer bit-width allocation | Greedy knapsack (INT4→INT8→FP16→FP32) |
| M4 | PLL — Predictive Layer Loader | Lookahead prefetch scheduling | EMA timing + DAG deadline analysis |
| M5 | ACS — Adaptive Compute Scheduler | Priority batch formation + device routing | 3-heap priority queue + roofline routing |
| M6 | NSC — Neural Semantic Cache | Embedding-based prompt deduplication | HNSW ANN + bi-encoder + adaptive threshold |
| M7 | Router — Hybrid CPU/GPU | Arithmetic-intensity routing | Roofline model + pipeline parallelism |
Quick Start
1. Single Inference (CLI)
uarc run "Explain quantum computing in simple terms"
uarc run "Write Python code to sort a list" --max-tokens 256 --stream
uarc run "Quick question" --priority realtime --json
2. Inference Server (OpenAI-compatible)
uarc serve --port 8000 --vram 8 --ram 32
# Then call it like any OpenAI API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "uarc-sim-7b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
3. Benchmark
uarc bench --requests 100 --vram 8 --seed-cache
4. Python API
from uarc import UARCRuntime, UARCConfig, InferenceRequest
cfg = UARCConfig.from_env() # reads UARC_VRAM_GB, UARC_RAM_GB, etc.
rt = UARCRuntime(cfg)
rt.start()
resp = rt.infer(InferenceRequest(
request_id="req-001",
prompt="What is machine learning?",
max_new_tokens=256,
))
print(resp.text)
print(f"Route: {resp.route_taken}") # draft / partial / full / cache
print(f"Latency: {resp.latency_ms:.1f}ms")
print(f"Compute saved: {resp.compute_saved_pct:.0f}%")
rt.stop()
5. Streaming
for chunk in rt.infer_stream(request):
print(chunk, end="", flush=True)
6. Batch
responses = rt.infer_batch([req1, req2, req3])
Configuration
from uarc import UARCConfig
# Presets
cfg = UARCConfig.for_edge() # 4GB RAM, no GPU
cfg = UARCConfig.for_gpu(vram_gb=24) # 24GB VRAM
# From environment variables
# UARC_VRAM_GB, UARC_RAM_GB, UARC_MAX_BATCH, UARC_PORT,
# UARC_NSC_THRESHOLD, UARC_DEVICE
cfg = UARCConfig.from_env()
# Manual
cfg = UARCConfig()
cfg.aivm.vram_mb = 8192 # 8 GB VRAM
cfg.tde.tau_easy = 2.5 # PPL threshold: easy tokens
cfg.tde.tau_hard = 8.0 # PPL threshold: hard tokens
cfg.nsc.similarity_threshold = 0.92
cfg.acs.max_batch_size = 32
Server Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
OpenAI chat (streaming supported) |
| POST | /v1/completions |
OpenAI completion (streaming supported) |
| GET | /v1/models |
Model list |
| GET | /health |
Health check |
| GET | /status |
Full module status JSON |
| GET | /metrics |
Prometheus metrics |
| POST | /admin/cache/clear |
Flush NSC cache |
| POST | /admin/evict |
Trigger memory eviction |
Architecture
InferenceRequest
│
▼
M6 NSC ──── hit ────────────────────────────► Response (cache)
│ miss
▼
M1 TDE ──── estimate PPL ──────► route: draft / partial / full
│
▼
M5 ACS ──── form batch ─────► prefix-sharing reorder
│
▼
M3 DPE ──── precision plan ──► INT4/INT8/FP16/FP32 per layer
│
▼
M4 PLL ──── prefetch ────────► async layer loading from NVMe
│
▼
M2 AI-VM ── VRAM/RAM/NVMe ──► CLOCK-Pro eviction
│
▼
M7 Router ─ CPU or GPU? ────► intensity < 4 → CPU, else GPU
│
▼
Model forward pass
│
▼
M6 NSC store
│
▼
Response
Development
git clone <repo>
cd uarc
pip install -e ".[dev]"
pytest tests/ -v
Performance Targets
| Scenario | Baseline | UARC Target |
|---|---|---|
| 70B on 8GB VRAM | OOM | 85 tok/s |
| 70B on 24GB GPU | 12 tok/s | 38 tok/s (+217%) |
| TTFT | 1.0× | 0.38× (−62%) |
| Concurrent requests | 4 | 18–32 |
| Compute per token | 1.0× | 0.30–0.70× |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
uarc-0.2.0.tar.gz
(39.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
uarc-0.2.0-py3-none-any.whl
(40.1 kB
view details)
File details
Details for the file uarc-0.2.0.tar.gz.
File metadata
- Download URL: uarc-0.2.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dba91f8f3cada4f6f3ec5d82a3da4f66db2feb75fc81d0850c34c7ff54db3989
|
|
| MD5 |
94cc667fd4b251d988ca5a44402e7256
|
|
| BLAKE2b-256 |
955f335853d31ecee529e330a5111dbeb660a88c53168f07f8b3a5750ee74835
|
File details
Details for the file uarc-0.2.0-py3-none-any.whl.
File metadata
- Download URL: uarc-0.2.0-py3-none-any.whl
- Upload date:
- Size: 40.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3301945d36bdeb00c61b4e9b649641bfa70290825b71714933bd5f3bf390e95b
|
|
| MD5 |
7bb329579dfe9502585b0b221b053552
|
|
| BLAKE2b-256 |
ac2b2051936646528b9a1903bc25b8e0f783870f7fb999fe41b2f69ce14490b3
|