Local LLM inference + embedding & search in one package. Run 30B on consumer hardware, RAG without ChromaDB.

These details have not been verified by PyPI

Project links

Project description

Hippo 🦛

pip install hippo-llm | Python 3.10+ | MIT | 中文文档

Run 30B models on a ¥3800 GPU at 78 tok/s. Then search through your documents without installing ChromaDB.

30-second setup

hippo-pipeline serve --model qwen3-30b-a3b-q3 --mode standalone
# → OpenAI-compatible API at localhost:8000/v1/chat/completions

import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
r = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[{"role": "user", "content": "Explain pipeline parallelism"}],
    max_tokens=500
)
print(r.choices[0].message.content)

Two-machine setup

# Machine 1
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 0

# Machine 2
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 1 \
  --coordinator http://192.168.1.10:9000

Split the model across machines. Run what doesn't fit on one GPU.

One install for inference + search

Most RAG setups need two services: Ollama for inference + ChromaDB for vectors. Hippo gives you both in one pip install.

from hippo.embedding import EmbeddingEngine, VectorStore

engine = EmbeddingEngine(model="nomic-embed-text")  # uses local Ollama
store = VectorStore("docs.db", mode="hybrid")  # BM25 + dense RRF fusion

# Add documents
store.add_batch([
    {"text": "Pipeline parallelism splits layers across devices", "metadata": {"source": "readme"}},
    {"text": "BM25 handles exact keyword matches", "metadata": {"source": "docs"}},
    {"text": "Speculative decoding improves latency by 2-3x", "metadata": {"source": "benchmarks"}},
], engine=engine)

# Hybrid search (BM25 + semantic, RRF fused)
results = store.search("how to run big models on small GPUs", engine=engine, top_k=5)
for doc in results:
    print(f"[{doc.score:.3f}] {doc.text}")

No external vector DB. SQLite for persistence, numpy for similarity. Works offline.

Full RAG example with local LLM

from hippo.embedding import EmbeddingEngine, VectorStore
import openai

# 1. Index your documents (one-time)
engine = EmbeddingEngine(model="nomic-embed-text")
store = VectorStore("knowledge.db", mode="hybrid")

documents = [
    "Hippo splits model layers across multiple devices using TCP.",
    "Each device only loads its shard of layers, reducing memory per device.",
    "The loop detector catches semantic repetition using Jaccard similarity.",
    "BM25 hybrid search combines keyword matching with semantic similarity.",
]
store.add_batch([{"text": d} for d in documents], engine=engine)

# 2. RAG query
query = "how does hippo handle memory?"
results = store.search(query, engine=engine, top_k=2)
context = "\n".join(doc.text for doc in results)

# 3. Generate answer with local LLM
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query}
    ]
)
print(response.choices[0].message.content)

What's inside

Feature	Details
Pipeline Parallelism	Split any HF model across N machines. Mac + PC mixed. Plain TCP, no MPI.
Loop Detection	Jaccard-similarity detector catches semantic repetition that `repeat_penalty` misses.
Embedding & Search	Dense + BM25 + hybrid RRF fusion. SQLite-backed, sub-ms queries.
Chinese-optimized BM25	Built-in Chinese tokenizer with stop words. No jieba needed.
ANN Index	Approximate nearest neighbor for large collections (>10K docs).
OpenAI-Compatible API	Drop-in `/v1/chat/completions`. Works with LangChain, LlamaIndex, anything.
Auto Memory Budget	Calculates shard splits from available VRAM automatically.

When to use Hippo

You want...	Use this
Local inference on one machine	`--mode standalone` with any GGUF model
Run a model too big for one device	`--mode pipeline` across 2+ machines
RAG without installing ChromaDB	`VectorStore(mode="hybrid")`
Search Chinese documents	BM25 with built-in tokenizer

Install

pip install hippo-llm

Requirements: Python 3.10+, Ollama running locally for model weights and embeddings.

Roadmap

v0.3: ANN index for >10K document collections ✅
v0.4: Multi-shard support (>2 devices), automatic layer balancing
v0.5: Speculative decoding across shards
v0.6: Built-in model download + GGUF auto-conversion

Benchmarks

Setup	Model	Speed
Mac Mini M2 (16GB)	Qwen3-4B-Q4	41 tok/s
RTX 5060 Ti (16GB)	Qwen3-14B-Q4	41 tok/s
2× Mac Mini (16GB each)	Qwen3-30B-A3B-Q3	78 tok/s
Mac Mini M2 (16GB)	Qwen3-30B-A3B-Q3	24 tok/s

License

MIT

Author

lawcontinue — GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Jun 11, 2026

This version

0.3.0

May 28, 2026

0.2.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hippo_llm-0.3.0.tar.gz (61.0 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hippo_llm-0.3.0-py3-none-any.whl (68.7 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file hippo_llm-0.3.0.tar.gz.

File metadata

Download URL: hippo_llm-0.3.0.tar.gz
Upload date: May 28, 2026
Size: 61.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hippo_llm-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`37fd05480576c0c7bc277ceede0a34e73926620a3c1f874eef4bee3c920b4f16`
MD5	`378d02b22723361c31f23e60f7ecdd9a`
BLAKE2b-256	`5d223acc40987143ef11794636de934ecd0e7e55a181985e584c012fb7f4ffcd`

See more details on using hashes here.

File details

Details for the file hippo_llm-0.3.0-py3-none-any.whl.

File metadata

Download URL: hippo_llm-0.3.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 68.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hippo_llm-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d80a945c225d65b0b67d0d636f3e52c999e322cf8f5d27f1b12a0cf49b171d9`
MD5	`5928956d153b0231662886d93133e26e`
BLAKE2b-256	`20c9c194d232e93296e0db0d998a3cdfeb366598acf2a0f295110a53c0bfd93c`

See more details on using hashes here.

hippo-llm 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hippo 🦛

30-second setup

One install for inference + search

What's inside

When to use Hippo

Install

Roadmap

Benchmarks

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes