Skip to main content

[fixed version of 0.4.2]Oprel is a high-performance Python library for running large language models locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and full multimodal support.

Project description

Oprel SDK

Production-ready local LLM inference that beats Ollama in performance

PyPI version Python 3.9+ GitHub License: MIT

Oprel is a high-performance Python library for running large language models and multimodal AI locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and intelligent optimization.

🚀 Key Features

  • Multi-Backend Architecture:

    • llama.cpp: Text generation & vision (GGUF models)
    • ComfyUI Integration: Image & video generation (Diffusion models)
    • Hybrid GPU/CPU: Smart layer distribution for low VRAM
  • Smart Hardware Optimization:

    • Hybrid Offloading: Run 13B models on 4GB GPUs by splitting layers between GPU/CPU
    • Auto-Quantization: Automatically selects best quality quantization based on available VRAM
    • CPU Acceleration: AVX2/AVX512 optimization (30-50% faster than Ollama's defaults)
    • KV-Cache Aware: Precise memory planning prevents OOM crashes
  • Production Reliability:

    • Memory Pressure Monitor: Proactive warnings before crashes
    • Idle Cleanup: Automatically frees GPU/CPU resources when inactive (15min timeout)
    • Zero-Latency: Server mode keeps models cached for instant response
    • Robust Error Handling: Clear error messages, no silent failures
  • Oprel Studio: Premium Web UI for chat, model management, and real-time hardware monitoring

  • Ollama Compatibility: Drop-in replacement for Ollama API

📦 Installation

pip install oprel
# For server mode
pip install oprel[server]

⚡ Quick Start

CLI Usage

# Chat with a model (auto-downloaded)
oprel run qwencoder "Explain recursion in one sentence"

# Interactive chat mode
oprel run llama3.1

# Server mode for persistent caching
oprel serve
oprel run llama3.1 "Hello"  # Instant response!

# Vision models
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Start Oprel Studio (Web UI)
oprel start

Python API

from oprel import Model

# Auto-optimized loading
model = Model("qwencoder") 
print(model.generate("Write a binary search in Python"))

🌐 Oprel Studio: The Ultimate Local AI Workspace

Oprel Studio is a premium, browser-based command center for your local AI models. Designed for engineers and researchers, it provides a state-of-the-art interface that transforms raw inference into a productive workspace.

✨ Immersive Chat Experience

  • Fluid Streaming: ultra-fast Server-Sent Events (SSE) for instant, typewriter-style responses.
  • Thinking Process Visualization: DeepSeek-R1 and other reasoning models show their internal "chain of thought" in a beautiful, expandable workspace.
  • Rich Markdown & Code: Full GFM support with syntax highlighting for 50+ languages.
  • Artifacts Canvas: Generate Mermaid diagrams or HTML/Tailwind previews and view them in a dedicated side-panel next to your chat.
  • Multi-modal Support: Drag and drop images for vision-capable models (Qwen-VL, Llama-3.2 Vision).

🔌 Beyond Local: External Cloud Providers

Manage your local models alongside industry-leading cloud APIs in one unified interface:

  • Google Gemini: Full support for 2.0 Flash/Pro with free-tier quota integration.
  • NVIDIA NIM: High-performance inference via NVIDIA's accelerated cloud.
  • Groq: Record-breaking inference speeds via LPU™ technology.
  • OpenRouter: Access 200+ models from a single API key.
  • Custom OpenAI: Connect any internal or third-party OpenAI-compatible server.

🏛️ Visual Model Registry

  • One-Click Deployment: Pull, load, and switch between models without ever touching the terminal.
  • Quantization Intelligence: See available quants (Q4_K, Q8_0, etc.) and their memory footprint before loading.
  • Smart Status: Real-time indicators show which model is currently taking up VRAM/RAM.

📊 Real-time Hardware Analytics

Monitor your system's performance as the model generates:

  • Tokens per Second (TPS): Live tracking of inference performance.
  • VRAM & RAM: Precise graphs showing memory consumption across CPU and GPU.
  • CPU/GPU Utilization: Monitor load to ensure your system is running optimally.

🚀 Usage

Start Oprel Studio and it will automatically open in your default browser:

oprel start

The interface is hosted at http://localhost:11435/gui/.

🎨 Image & Video Generation

ComfyUI is embedded - auto-installs and downloads models automatically!

Usage

# Specify model in command
oprel gen-image sdxl-turbo "a cyberpunk city at night"

# High quality with FLUX
oprel gen-image flux-1-schnell "a majestic dragon" --width 1024 --height 1024 --steps 30

# With negative prompt
oprel gen-image sdxl-turbo "a cute cat" --negative "blurry, low quality"

# First time downloads model automatically
oprel gen-image flux-1-dev "stunning landscape"  # Auto-downloads 23GB

Download Models

# List available image models
oprel list-models --category text-to-image

# Pre-download model
oprel pull flux-1-schnell

# Pull video model
oprel pull svd-xt

🔍 Text Embeddings

Generate embeddings for semantic search and RAG applications:

CLI Usage

# Single text embedding
oprel embed nomic-embed-text "Hello world"

# Process files (PDF, DOCX, TXT, JSON)
oprel embed nomic-embed-text --files document.pdf report.docx notes.txt

# Batch processing from file (one text per line)
oprel embed nomic-embed-text --batch texts.txt --output embeddings.json

# JSON output format
oprel embed nomic-embed-text "Machine learning" --format json

Python API

from oprel import embed

# Single embedding
vector = embed("Hello world", model="nomic-embed-text")
print(f"Dimensions: {len(vector)}")

# Batch embeddings
vectors = embed(
    ["Document 1", "Document 2", "Document 3"],
    model="nomic-embed-text"
)

# Semantic search
import math

def cosine_similarity(a, b):
    dot = sum(x*y for x,y in zip(a,b))
    mag_a = math.sqrt(sum(x*x for x in a))
    mag_b = math.sqrt(sum(x*x for x in b))
    return dot / (mag_a * mag_b)

query = embed("machine learning topic")
docs = embed(["AI concepts", "cooking recipes", "ML algorithms"])
similarities = [cosine_similarity(query, doc) for doc in docs]
best_match = similarities.index(max(similarities))
print(f"Best match: Document {best_match}")

Available Embedding Models

  • nomic-embed-text: General-purpose (768 dims)
  • bge-m3: Multilingual support (1024 dims)
  • all-minilm-l6-v2: Lightweight & fast (384 dims)
  • snowflake-arctic: Optimized for RAG (1024 dims)
# List all embedding models
oprel list-models --category embeddings

Available Models:

  • sdxl-turbo - Fastest (1-4 steps, 7GB) ⚡
  • flux-1-schnell - Fast + quality (4 steps, 23GB)
  • flux-1-dev - Best quality (28 steps, 23GB)
  • sd-1.5 - Lightweight (4GB)

Vision Models

# Ask about an image
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Multi-image analysis
oprel vision qwen3-vl-14b "Compare these images" --images img1.jpg img2.jpg img3.jpg

🛠️ Advanced Features

Hybrid GPU/CPU Offloading

Run larger models on limited VRAM by intelligently splitting layers.

# Automatically calculated during load
# Example: "20/40 layers on GPU, 20 on CPU"

Smart Quantization

Auto-selects the best quantization that fits your hardware.

oprel run llama3.1 --quantization auto  # Default

OpenAI & Ollama Compatible Server (Week 14 ✨)

Production-ready API server with smart model management

Start the server:

oprel serve --host 127.0.0.1 --port 11435

The server provides:

  • OpenAI API compatibility: /v1/chat/completions, /v1/completions, /v1/models
  • Ollama API compatibility: /api/chat, /api/generate, /api/tags
  • Smart Model Management:
    • Models stay loaded for 15 minutes after last use
    • Automatic model switching when switching between models
    • Zero manual load/unload needed
  • Fast SSE Streaming: Server-Sent Events for instant token delivery
  • CORS Support: Use from web applications

OpenAI API Examples

Python (using OpenAI SDK):

from openai import OpenAI

# Point to local Oprel server
client = OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="not-needed"  # Oprel doesn't require API keys
)

# Chat completion
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[
        {"role": "user", "content": "Write a Python function to reverse a string"}
    ],
    stream=True  # Enable streaming for fast responses
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

cURL:

# Chat completions (streaming)
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Text Completions
curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

# List Models
curl http://localhost:11435/v1/models

Ollama API Examples

Python (using Ollama SDK):

import ollama

# Works directly with Ollama SDK
client = ollama.Client(host='http://localhost:11435')
response = client.chat(
    model='llama3', 
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True
)

for chunk in response:
    print(chunk['message']['content'], end='')

cURL:

# Ollama-style chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# List models (Ollama format)
curl http://localhost:11434/api/tags

Model Management Behavior

The server automatically manages models with these rules:

  1. First Request: Model is loaded (takes ~5-30s depending on size)
  2. Subsequent Requests: Model is already loaded (instant response)
  3. Model Switch: Old model unloads, new model loads automatically
  4. Idle Timeout: After 15 minutes of no requests, model is unloaded to free memory
  5. No Manual Management: You never need to call load/unload - it's automatic!

Example workflow:

# Start server
oprel serve

# In another terminal:
# First request - loads qwen3-14b (~10s load time)
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Hi"}]}'

# Second request - instant! Model already loaded
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Tell me a joke"}]}'

# Switch to different model - automatically unloads qwen3-14b and loads llama3.1
curl http://localhost:11434/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hi"}]}'

# After 15 minutes of inactivity, llama3.1 is automatically unloaded

Health Check

curl http://localhost:11434/health
# Returns: {"status":"healthy","timestamp":1234567890,"current_model":"qwen3-14b"}

📊 Benchmarks vs Ollama

Feature Ollama Oprel SDK
Model Discovery 10-30s Instant (<100ms)
Memory Planning Basic Precise (KV-Cache aware)
Low VRAM Support Fails/Slow Hybrid Offloading
CPU Speed Standard 30-50% Faster (AVX)
Vision Models Partial Full Support
Image/Video Gen No ComfyUI Integration
Crash Safety Frequent OOM Proactive Warnings
Auto-Optimization Manual config Fully Automatic

🧩 Supported Models

Text Generation Models (GGUF - llama.cpp backend)

  • Qwen 3 / 2.5: Best all-around models (32B, 14B, 8B, 3B)
  • Qwen 3 Coder: SOTA for code generation (32B, 14B, 8B)
  • DeepSeek R1: Advanced reasoning (14B, 8B, 7B, 1.5B)
  • Llama 3.3 / 3.1: Meta's flagship (70B, 8B)
  • Gemma 3 / 2: Google's efficient models (27B, 12B, 9B, 4B)
  • Phi-4: Microsoft's compact powerhouse (14B)

Vision Models (VLMs) - GGUF + mmproj

  • Qwen3-VL: Multi-image understanding (32B, 14B, 7B - supports up to 8 images)
  • Qwen2.5-VL: Proven vision model (7B, 3B)
  • Llama 3.2 Vision: Meta's VLM (11B)
  • MiniCPM-V: Efficient mobile-ready VLM (2.6B)
  • Moondream 2: Lightweight vision (1.8B)

Image Generation (Safetensors - ComfyUI backend)

Requires ComfyUI running:

  • FLUX.1-dev: Best quality
  • FLUX.1-schnell: Fast generation
  • SDXL Turbo: Fastest (1-4 steps)

Video Generation (ComfyUI + AnimateDiff)

Requires ComfyUI with video nodes:

  • AnimateDiff
  • Stable Video Diffusion (SVD)
  • Custom workflows

View all available GGUF models:

oprel list-models --category text-generation
oprel list-models --category vision
oprel list-models --category coding
oprel list-models --category reasoning

License

MIT License. Made with ❤️ for local AI developers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oprel-0.4.3.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oprel-0.4.3-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file oprel-0.4.3.tar.gz.

File metadata

  • Download URL: oprel-0.4.3.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oprel-0.4.3.tar.gz
Algorithm Hash digest
SHA256 8ffce306f1f9345555ee59b527ed0945d2fd27de10571431c033fb8998a57686
MD5 b3f8dbb1eeb2a71cfd5116ac34ded0a6
BLAKE2b-256 2a327b5eb9009a31dc317aa175bab92b6ae03c6bee0f2927680d0a3d1de2933f

See more details on using hashes here.

File details

Details for the file oprel-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: oprel-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for oprel-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6b0d849c0d35b7cdc5e3252e4321c29baf2d18b81ba63124d483b3ab1e298707
MD5 3c8278f26d13605f44fdfb68a79007a8
BLAKE2b-256 6fde3ef8fc198d94d71c67d5984fad7b2de6e1370c39f127987f8311a3f90228

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page