[fixed version of 0.4.2]Oprel is a high-performance Python library for running large language models locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and full multimodal support.
Project description
Oprel SDK
Production-ready local LLM inference that beats Ollama in performance
Oprel is a high-performance Python library for running large language models and multimodal AI locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and intelligent optimization.
🚀 Key Features
-
Multi-Backend Architecture:
- llama.cpp: Text generation & vision (GGUF models)
- ComfyUI Integration: Image & video generation (Diffusion models)
- Hybrid GPU/CPU: Smart layer distribution for low VRAM
-
Smart Hardware Optimization:
- Hybrid Offloading: Run 13B models on 4GB GPUs by splitting layers between GPU/CPU
- Auto-Quantization: Automatically selects best quality quantization based on available VRAM
- CPU Acceleration: AVX2/AVX512 optimization (30-50% faster than Ollama's defaults)
- KV-Cache Aware: Precise memory planning prevents OOM crashes
-
Production Reliability:
- Memory Pressure Monitor: Proactive warnings before crashes
- Idle Cleanup: Automatically frees GPU/CPU resources when inactive (15min timeout)
- Zero-Latency: Server mode keeps models cached for instant response
- Robust Error Handling: Clear error messages, no silent failures
-
Oprel Studio: Premium Web UI for chat, model management, and real-time hardware monitoring
-
Ollama Compatibility: Drop-in replacement for Ollama API
📦 Installation
pip install oprel
# For server mode
pip install oprel[server]
⚡ Quick Start
CLI Usage
# Chat with a model (auto-downloaded)
oprel run qwencoder "Explain recursion in one sentence"
# Interactive chat mode
oprel run llama3.1
# Server mode for persistent caching
oprel serve
oprel run llama3.1 "Hello" # Instant response!
# Vision models
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg
# Start Oprel Studio (Web UI)
oprel start
Python API
from oprel import Model
# Auto-optimized loading
model = Model("qwencoder")
print(model.generate("Write a binary search in Python"))
🌐 Oprel Studio: The Ultimate Local AI Workspace
Oprel Studio is a premium, browser-based command center for your local AI models. Designed for engineers and researchers, it provides a state-of-the-art interface that transforms raw inference into a productive workspace.
✨ Immersive Chat Experience
- Fluid Streaming: ultra-fast Server-Sent Events (SSE) for instant, typewriter-style responses.
- Thinking Process Visualization: DeepSeek-R1 and other reasoning models show their internal "chain of thought" in a beautiful, expandable workspace.
- Rich Markdown & Code: Full GFM support with syntax highlighting for 50+ languages.
- Artifacts Canvas: Generate Mermaid diagrams or HTML/Tailwind previews and view them in a dedicated side-panel next to your chat.
- Multi-modal Support: Drag and drop images for vision-capable models (Qwen-VL, Llama-3.2 Vision).
🔌 Beyond Local: External Cloud Providers
Manage your local models alongside industry-leading cloud APIs in one unified interface:
- Google Gemini: Full support for 2.0 Flash/Pro with free-tier quota integration.
- NVIDIA NIM: High-performance inference via NVIDIA's accelerated cloud.
- Groq: Record-breaking inference speeds via LPU™ technology.
- OpenRouter: Access 200+ models from a single API key.
- Custom OpenAI: Connect any internal or third-party OpenAI-compatible server.
🏛️ Visual Model Registry
- One-Click Deployment: Pull, load, and switch between models without ever touching the terminal.
- Quantization Intelligence: See available quants (Q4_K, Q8_0, etc.) and their memory footprint before loading.
- Smart Status: Real-time indicators show which model is currently taking up VRAM/RAM.
📊 Real-time Hardware Analytics
Monitor your system's performance as the model generates:
- Tokens per Second (TPS): Live tracking of inference performance.
- VRAM & RAM: Precise graphs showing memory consumption across CPU and GPU.
- CPU/GPU Utilization: Monitor load to ensure your system is running optimally.
🚀 Usage
Start Oprel Studio and it will automatically open in your default browser:
oprel start
The interface is hosted at http://localhost:11435/gui/.
🎨 Image & Video Generation
ComfyUI is embedded - auto-installs and downloads models automatically!
Usage
# Specify model in command
oprel gen-image sdxl-turbo "a cyberpunk city at night"
# High quality with FLUX
oprel gen-image flux-1-schnell "a majestic dragon" --width 1024 --height 1024 --steps 30
# With negative prompt
oprel gen-image sdxl-turbo "a cute cat" --negative "blurry, low quality"
# First time downloads model automatically
oprel gen-image flux-1-dev "stunning landscape" # Auto-downloads 23GB
Download Models
# List available image models
oprel list-models --category text-to-image
# Pre-download model
oprel pull flux-1-schnell
# Pull video model
oprel pull svd-xt
🔍 Text Embeddings
Generate embeddings for semantic search and RAG applications:
CLI Usage
# Single text embedding
oprel embed nomic-embed-text "Hello world"
# Process files (PDF, DOCX, TXT, JSON)
oprel embed nomic-embed-text --files document.pdf report.docx notes.txt
# Batch processing from file (one text per line)
oprel embed nomic-embed-text --batch texts.txt --output embeddings.json
# JSON output format
oprel embed nomic-embed-text "Machine learning" --format json
Python API
from oprel import embed
# Single embedding
vector = embed("Hello world", model="nomic-embed-text")
print(f"Dimensions: {len(vector)}")
# Batch embeddings
vectors = embed(
["Document 1", "Document 2", "Document 3"],
model="nomic-embed-text"
)
# Semantic search
import math
def cosine_similarity(a, b):
dot = sum(x*y for x,y in zip(a,b))
mag_a = math.sqrt(sum(x*x for x in a))
mag_b = math.sqrt(sum(x*x for x in b))
return dot / (mag_a * mag_b)
query = embed("machine learning topic")
docs = embed(["AI concepts", "cooking recipes", "ML algorithms"])
similarities = [cosine_similarity(query, doc) for doc in docs]
best_match = similarities.index(max(similarities))
print(f"Best match: Document {best_match}")
Available Embedding Models
- nomic-embed-text: General-purpose (768 dims)
- bge-m3: Multilingual support (1024 dims)
- all-minilm-l6-v2: Lightweight & fast (384 dims)
- snowflake-arctic: Optimized for RAG (1024 dims)
# List all embedding models
oprel list-models --category embeddings
Available Models:
sdxl-turbo- Fastest (1-4 steps, 7GB) ⚡flux-1-schnell- Fast + quality (4 steps, 23GB)flux-1-dev- Best quality (28 steps, 23GB)sd-1.5- Lightweight (4GB)
Vision Models
# Ask about an image
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg
# Multi-image analysis
oprel vision qwen3-vl-14b "Compare these images" --images img1.jpg img2.jpg img3.jpg
🛠️ Advanced Features
Hybrid GPU/CPU Offloading
Run larger models on limited VRAM by intelligently splitting layers.
# Automatically calculated during load
# Example: "20/40 layers on GPU, 20 on CPU"
Smart Quantization
Auto-selects the best quantization that fits your hardware.
oprel run llama3.1 --quantization auto # Default
OpenAI & Ollama Compatible Server (Week 14 ✨)
Production-ready API server with smart model management
Start the server:
oprel serve --host 127.0.0.1 --port 11435
The server provides:
- OpenAI API compatibility:
/v1/chat/completions,/v1/completions,/v1/models - Ollama API compatibility:
/api/chat,/api/generate,/api/tags - Smart Model Management:
- Models stay loaded for 15 minutes after last use
- Automatic model switching when switching between models
- Zero manual load/unload needed
- Fast SSE Streaming: Server-Sent Events for instant token delivery
- CORS Support: Use from web applications
OpenAI API Examples
Python (using OpenAI SDK):
from openai import OpenAI
# Point to local Oprel server
client = OpenAI(
base_url="http://localhost:11435/v1",
api_key="not-needed" # Oprel doesn't require API keys
)
# Chat completion
response = client.chat.completions.create(
model="qwen3-14b",
messages=[
{"role": "user", "content": "Write a Python function to reverse a string"}
],
stream=True # Enable streaming for fast responses
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
cURL:
# Chat completions (streaming)
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-0.5b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Text Completions
curl http://localhost:11435/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-0.5b",
"prompt": "Once upon a time",
"max_tokens": 50
}'
# List Models
curl http://localhost:11435/v1/models
Ollama API Examples
Python (using Ollama SDK):
import ollama
# Works directly with Ollama SDK
client = ollama.Client(host='http://localhost:11435')
response = client.chat(
model='llama3',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True
)
for chunk in response:
print(chunk['message']['content'], end='')
cURL:
# Ollama-style chat
curl http://localhost:11434/api/chat \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
# List models (Ollama format)
curl http://localhost:11434/api/tags
Model Management Behavior
The server automatically manages models with these rules:
- First Request: Model is loaded (takes ~5-30s depending on size)
- Subsequent Requests: Model is already loaded (instant response)
- Model Switch: Old model unloads, new model loads automatically
- Idle Timeout: After 15 minutes of no requests, model is unloaded to free memory
- No Manual Management: You never need to call load/unload - it's automatic!
Example workflow:
# Start server
oprel serve
# In another terminal:
# First request - loads qwen3-14b (~10s load time)
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Hi"}]}'
# Second request - instant! Model already loaded
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Tell me a joke"}]}'
# Switch to different model - automatically unloads qwen3-14b and loads llama3.1
curl http://localhost:11434/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hi"}]}'
# After 15 minutes of inactivity, llama3.1 is automatically unloaded
Health Check
curl http://localhost:11434/health
# Returns: {"status":"healthy","timestamp":1234567890,"current_model":"qwen3-14b"}
📊 Benchmarks vs Ollama
| Feature | Ollama | Oprel SDK |
|---|---|---|
| Model Discovery | 10-30s | Instant (<100ms) |
| Memory Planning | Basic | Precise (KV-Cache aware) |
| Low VRAM Support | Fails/Slow | Hybrid Offloading |
| CPU Speed | Standard | 30-50% Faster (AVX) |
| Vision Models | Partial | Full Support |
| Image/Video Gen | No | ComfyUI Integration |
| Crash Safety | Frequent OOM | Proactive Warnings |
| Auto-Optimization | Manual config | Fully Automatic |
🧩 Supported Models
Text Generation Models (GGUF - llama.cpp backend)
- Qwen 3 / 2.5: Best all-around models (32B, 14B, 8B, 3B)
- Qwen 3 Coder: SOTA for code generation (32B, 14B, 8B)
- DeepSeek R1: Advanced reasoning (14B, 8B, 7B, 1.5B)
- Llama 3.3 / 3.1: Meta's flagship (70B, 8B)
- Gemma 3 / 2: Google's efficient models (27B, 12B, 9B, 4B)
- Phi-4: Microsoft's compact powerhouse (14B)
Vision Models (VLMs) - GGUF + mmproj
- Qwen3-VL: Multi-image understanding (32B, 14B, 7B - supports up to 8 images)
- Qwen2.5-VL: Proven vision model (7B, 3B)
- Llama 3.2 Vision: Meta's VLM (11B)
- MiniCPM-V: Efficient mobile-ready VLM (2.6B)
- Moondream 2: Lightweight vision (1.8B)
Image Generation (Safetensors - ComfyUI backend)
Requires ComfyUI running:
- FLUX.1-dev: Best quality
- FLUX.1-schnell: Fast generation
- SDXL Turbo: Fastest (1-4 steps)
Video Generation (ComfyUI + AnimateDiff)
Requires ComfyUI with video nodes:
- AnimateDiff
- Stable Video Diffusion (SVD)
- Custom workflows
View all available GGUF models:
oprel list-models --category text-generation
oprel list-models --category vision
oprel list-models --category coding
oprel list-models --category reasoning
License
MIT License. Made with ❤️ for local AI developers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oprel-0.4.3.tar.gz.
File metadata
- Download URL: oprel-0.4.3.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ffce306f1f9345555ee59b527ed0945d2fd27de10571431c033fb8998a57686
|
|
| MD5 |
b3f8dbb1eeb2a71cfd5116ac34ded0a6
|
|
| BLAKE2b-256 |
2a327b5eb9009a31dc317aa175bab92b6ae03c6bee0f2927680d0a3d1de2933f
|
File details
Details for the file oprel-0.4.3-py3-none-any.whl.
File metadata
- Download URL: oprel-0.4.3-py3-none-any.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b0d849c0d35b7cdc5e3252e4321c29baf2d18b81ba63124d483b3ab1e298707
|
|
| MD5 |
3c8278f26d13605f44fdfb68a79007a8
|
|
| BLAKE2b-256 |
6fde3ef8fc198d94d71c67d5984fad7b2de6e1370c39f127987f8311a3f90228
|