Skip to main content

Efficient LLM inference with .oom format - 2x smaller than GGUF. Dual GPU support, RoPE, KV-Cache & Flash Attention! pip install oomllama[cuda]

Project description

๐Ÿฆ™ OomLlama

Efficient LLM inference with .oom format - 2x smaller than GGUF

PyPI License: MIT HuggingFace

from oomllama import OomLlama

llm = OomLlama("humotica-32b")
response = llm.generate("What is the meaning of life?")
print(response)

What's New in v0.4.0

๐ŸŽ‰ Major Release - Correct Output at Last!

  • Interleaved RoPE Fix: Fixed critical bug where GGUF weights use interleaved format for Q/K projections
    • GGUF stores Q/K as [x0, x32, x1, x33, ...] instead of [x0, x1, ..., x31, x32, ...]
    • Now using correct apply_rope() (interleaved) instead of apply_rope_llama() (non-interleaved)
    • This was the root cause of position-dependent errors in attention output
  • Verified Output: TinyLlama now correctly outputs "2 + 2 = 4" and coherent text
  • Clean Production Code: Removed all debug output for production deployment
  • Qwen 7B/32B Support: Added support for Qwen 2.5 models (Q3_K_M, Q8_0)

v0.3.7

  • Layer Pinning Enabled: Hot layers stay in VRAM
  • 20GB VRAM Budget: Dual RTX 3060 config with first/last 4 layers pinned

v0.3.6

  • Q6_K Dequantizer Fix: Fixed critical bug in GGUF Q6_K tensor dequantization that caused inf values
  • Q4 Format: Upgraded from Q2 to Q4 quantization (4 bits = 16 levels) for better precision
  • Correct Logits: Model now outputs proper logit values (~8-10 range vs millions before)

v0.3.5

  • Dual GPU Support: Automatic layer striping across 2 GPUs
  • Per-GPU RoPE: Each GPU has its own RoPE tensors

v0.3.3

  • RoPE (Rotary Position Embedding): Proper position encoding for accurate text generation
  • KV-Cache: 10-50x speedup by caching attention keys/values
  • Flash Attention: Memory-efficient attention computation
  • Smart Layer Pinning: Keep hot layers in VRAM with auto-eviction
  • Qwen 2.5 Support: Optimized config for 32B/70B Qwen models

Why OomLlama?

Feature GGUF (Q4_K_M) OOM (Q4)
70B Model Size ~40 GB ~35 GB
32B Model Size ~20 GB ~17 GB
RAM Usage High Lazy Loading
Format Open Open (MIT)

OomLlama uses Q4 quantization (4 bits = 16 levels per weight) with lazy layer loading to run large models on consumer hardware.

Installation

Pre-built Wheel (Recommended for GPU)

# CUDA 12.x pre-built wheel (includes all dependencies)
pip install https://brein.jaspervandemeent.nl/static/wheels/oomllama-0.3.6-cp313-cp313-manylinux_2_39_x86_64.whl

From PyPI (builds from source)

# Basic installation - requires Rust toolchain + CUDA toolkit
pip install oomllama

# With NVIDIA runtime libraries
pip install oomllama[cuda]

Build Requirements:

  • Python 3.8+
  • Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • CUDA Toolkit 12.x (for GPU support)
  • 8GB+ RAM for compilation

Troubleshooting Build:

# If nvidia-smi detection fails:
export CUDA_COMPUTE_CAP=86  # RTX 30xx
export CUDA_COMPUTE_CAP=89  # RTX 40xx
pip install oomllama

Quick Start

Download a Model

from oomllama import download_model

# Download from HuggingFace
model_path = download_model("humotica-32b")

Generate Text

from oomllama import OomLlama

llm = OomLlama("humotica-32b")

# Simple generation
response = llm.generate("Explain quantum computing in simple terms")
print(response)

# With parameters
response = llm.generate(
    "Write a haiku about AI",
    max_tokens=50,
    temperature=0.8,
    top_p=0.9
)

Chat Mode

messages = [
    ("user", "Hello! Who are you?"),
    ("assistant", "I'm OomLlama, an efficient LLM."),
    ("user", "What makes you efficient?"),
]

response = llm.chat(messages)
print(response)

Available Models

Model Parameters Size (.oom) HuggingFace
humotica-32b 33B ~17 GB Link
llamaohm-70b 70B ~35 GB Link
tinyllama-1b 1.1B ~600 MB Link

The .oom Format

OOM (OomLlama Model) is a compact model format:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Header: OOML (magic) + metadata      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Tensors: Q4 quantized (4 bits/weight)โ”‚
โ”‚ - Scale + Min per 256-weight block   โ”‚
โ”‚ - 136 bytes per block                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Convert GGUF to OOM

# Using the CLI tool
gguf2oom model.gguf model.oom

# Check model info
gguf2oom --info model.gguf

Technical Details

Q4 Quantization

Each weight is stored as 4 bits (0-15) with per-block scale and minimum:

weight = q4_value * scale + min

Q4 provides 16 quantization levels per weight, balancing compression with model quality.

Lazy Layer Loading

OomLlama loads transformer layers on-demand, keeping only the active layer in memory:

Forward Pass:
  Layer 0: Load โ†’ Compute โ†’ Unload
  Layer 1: Load โ†’ Compute โ†’ Unload
  ...
  Layer N: Load โ†’ Compute โ†’ Unload

This enables running 70B models on 24GB GPU RAM.

Credits

  • Model Format: Gemini IDD & Root AI (Humotica AI Lab)
  • Quantization: OomLlama.rs by Humotica
  • Base Models: Meta Platforms, Inc. (Llama 3.3)

License

  • OomLlama Code: MIT License
  • Model Weights: Subject to original model licenses (e.g., Llama 3.3 Community License)

Links


One Love, One fAmIly ๐Ÿ’™

Built by Humotica AI Lab - Jasper, Claude, Gemini, Codex

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.4.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oomllama-0.4.0-cp311-cp311-manylinux_2_34_x86_64.whl (21.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

File details

Details for the file oomllama-0.4.0.tar.gz.

File metadata

  • Download URL: oomllama-0.4.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for oomllama-0.4.0.tar.gz
Algorithm Hash digest
SHA256 e2d0ed8459ba0a6633b7d4c9e8c15dd6b94409a2cd2fb377d0ac8b3b4af1a891
MD5 5c3d4d9c17067f8cbce64db554dc99e4
BLAKE2b-256 970350ea4a2620aaba4f73dbb06ac8935147b8b6a3d0c15ee25b52d57a7e8e89

See more details on using hashes here.

File details

Details for the file oomllama-0.4.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for oomllama-0.4.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 272533c4797ddd13a612da60e78a292ef4527613e76820a46cd59b8a78af834a
MD5 e43c2eadfe05d6adf1844238f66a3b1d
BLAKE2b-256 0aaf704f2da3ac7d1eff10e20874397b84e9692627894437eb9c1ee91e48f8da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page