Efficient LLM inference with .oom format - 2x smaller than GGUF. Dual GPU support, RoPE, KV-Cache & Flash Attention! pip install oomllama[cuda]

These details have not been verified by PyPI

Project links

Project description

🦙 OomLlama

Efficient LLM inference with .oom format - 2x smaller than GGUF

from oomllama import OomLlama

llm = OomLlama("humotica-32b")
response = llm.generate("What is the meaning of life?")
print(response)

What's New in v0.3.7

Layer Pinning Enabled: Hot layers now stay in VRAM (was disabled in 0.3.6)
20GB VRAM Budget: Dual RTX 3060 config now uses 20GB budget with first/last 4 layers pinned
Performance Boost: Reduced disk I/O by keeping frequently-used layers cached

v0.3.6

Q6_K Dequantizer Fix: Fixed critical bug in GGUF Q6_K tensor dequantization that caused inf values
Q4 Format: Upgraded from Q2 to Q4 quantization (4 bits = 16 levels) for better precision
Correct Logits: Model now outputs proper logit values (~8-10 range vs millions before)

v0.3.5

Dual GPU Support: Automatic layer striping across 2 GPUs
Per-GPU RoPE: Each GPU has its own RoPE tensors

v0.3.3

RoPE (Rotary Position Embedding): Proper position encoding for accurate text generation
KV-Cache: 10-50x speedup by caching attention keys/values
Flash Attention: Memory-efficient attention computation
Smart Layer Pinning: Keep hot layers in VRAM with auto-eviction
Qwen 2.5 Support: Optimized config for 32B/70B Qwen models

Why OomLlama?

Feature	GGUF (Q4_K_M)	OOM (Q4)
70B Model Size	~40 GB	~35 GB
32B Model Size	~20 GB	~17 GB
RAM Usage	High	Lazy Loading
Format	Open	Open (MIT)

OomLlama uses Q4 quantization (4 bits = 16 levels per weight) with lazy layer loading to run large models on consumer hardware.

Installation

Pre-built Wheel (Recommended for GPU)

# CUDA 12.x pre-built wheel (includes all dependencies)
pip install https://brein.jaspervandemeent.nl/static/wheels/oomllama-0.3.6-cp313-cp313-manylinux_2_39_x86_64.whl

From PyPI (builds from source)

# Basic installation - requires Rust toolchain + CUDA toolkit
pip install oomllama

# With NVIDIA runtime libraries
pip install oomllama[cuda]

Build Requirements:

Python 3.8+
Rust 1.70+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
CUDA Toolkit 12.x (for GPU support)
8GB+ RAM for compilation

Troubleshooting Build:

# If nvidia-smi detection fails:
export CUDA_COMPUTE_CAP=86  # RTX 30xx
export CUDA_COMPUTE_CAP=89  # RTX 40xx
pip install oomllama

Quick Start

Download a Model

from oomllama import download_model

# Download from HuggingFace
model_path = download_model("humotica-32b")

Generate Text

from oomllama import OomLlama

llm = OomLlama("humotica-32b")

# Simple generation
response = llm.generate("Explain quantum computing in simple terms")
print(response)

# With parameters
response = llm.generate(
    "Write a haiku about AI",
    max_tokens=50,
    temperature=0.8,
    top_p=0.9
)

Chat Mode

messages = [
    ("user", "Hello! Who are you?"),
    ("assistant", "I'm OomLlama, an efficient LLM."),
    ("user", "What makes you efficient?"),
]

response = llm.chat(messages)
print(response)

Available Models

Model	Parameters	Size (.oom)	HuggingFace
humotica-32b	33B	~17 GB	Link
llamaohm-70b	70B	~35 GB	Link
tinyllama-1b	1.1B	~600 MB	Link

The .oom Format

OOM (OomLlama Model) is a compact model format:

┌──────────────────────────────────────┐
│ Header: OOML (magic) + metadata      │
├──────────────────────────────────────┤
│ Tensors: Q4 quantized (4 bits/weight)│
│ - Scale + Min per 256-weight block   │
│ - 136 bytes per block                │
└──────────────────────────────────────┘

Convert GGUF to OOM

# Using the CLI tool
gguf2oom model.gguf model.oom

# Check model info
gguf2oom --info model.gguf

Technical Details

Q4 Quantization

Each weight is stored as 4 bits (0-15) with per-block scale and minimum:

weight = q4_value * scale + min

Q4 provides 16 quantization levels per weight, balancing compression with model quality.

Lazy Layer Loading

OomLlama loads transformer layers on-demand, keeping only the active layer in memory:

Forward Pass:
  Layer 0: Load → Compute → Unload
  Layer 1: Load → Compute → Unload
  ...
  Layer N: Load → Compute → Unload

This enables running 70B models on 24GB GPU RAM.

Credits

Model Format: Gemini IDD & Root AI (Humotica AI Lab)
Quantization: OomLlama.rs by Humotica
Base Models: Meta Platforms, Inc. (Llama 3.3)

License

OomLlama Code: MIT License
Model Weights: Subject to original model licenses (e.g., Llama 3.3 Community License)

Links

One Love, One fAmIly 💙

Built by Humotica AI Lab - Jasper, Claude, Gemini, Codex

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0a2 pre-release

Apr 19, 2026

1.0.0a1 pre-release

Apr 19, 2026

0.9.0

Mar 16, 2026

0.8.0

Feb 26, 2026

0.7.0

Feb 25, 2026

0.6.0

Feb 21, 2026

0.5.0

Feb 20, 2026

0.4.0

Feb 2, 2026

This version

0.3.7

Jan 17, 2026

0.3.6

Jan 17, 2026

0.3.5

Jan 17, 2026

0.3.4

Jan 17, 2026

0.3.2

Jan 17, 2026

0.3.1

Jan 17, 2026

0.3.0

Jan 17, 2026

0.2.0

Jan 17, 2026

0.1.0

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oomllama-0.3.7.tar.gz (121.9 kB view details)

Uploaded Jan 17, 2026 Source

File details

Details for the file oomllama-0.3.7.tar.gz.

File metadata

Download URL: oomllama-0.3.7.tar.gz
Upload date: Jan 17, 2026
Size: 121.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for oomllama-0.3.7.tar.gz
Algorithm	Hash digest
SHA256	`2638a2a8bd3dd63c16e6b69bab7df719c81b991f030ab86509f234d05bc12d54`
MD5	`3102c958b08693101d592321753aea1e`
BLAKE2b-256	`c5c4f8d677a7e80a686525e65ab1b23e283a7409c9912d8573012c9007962183`

See more details on using hashes here.

oomllama 0.3.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🦙 OomLlama

What's New in v0.3.7

v0.3.6

v0.3.5

v0.3.3

Why OomLlama?

Installation

Pre-built Wheel (Recommended for GPU)

From PyPI (builds from source)

Quick Start

Download a Model

Generate Text

Chat Mode

Available Models

The .oom Format

Convert GGUF to OOM

Technical Details

Q4 Quantization

Lazy Layer Loading

Credits

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes