Skip to main content

Local LLM inference + embedding & search in one package. Run 30B on consumer hardware, RAG without ChromaDB.

Project description

Hippo 🦛

pip install hippo-llm | Python 3.10+ | MIT | 中文文档

Run 30B models on a ¥3800 GPU at 78 tok/s. Then search through your documents without installing ChromaDB.

30-second setup

hippo-pipeline serve --model qwen3-30b-a3b-q3 --mode standalone
# → OpenAI-compatible API at localhost:8000/v1/chat/completions
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
r = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[{"role": "user", "content": "Explain pipeline parallelism"}],
    max_tokens=500
)
print(r.choices[0].message.content)
Two-machine setup
# Machine 1
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 0

# Machine 2
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 1 \
  --coordinator http://192.168.1.10:9000

Split the model across machines. Run what doesn't fit on one GPU.

One install for inference + search

Most RAG setups need two services: Ollama for inference + ChromaDB for vectors. Hippo gives you both in one pip install.

from hippo.embedding import EmbeddingEngine, VectorStore

engine = EmbeddingEngine(model="nomic-embed-text")  # uses local Ollama
store = VectorStore("docs.db", mode="hybrid")  # BM25 + dense RRF fusion

# Add documents
store.add_batch([
    {"text": "Pipeline parallelism splits layers across devices", "metadata": {"source": "readme"}},
    {"text": "BM25 handles exact keyword matches", "metadata": {"source": "docs"}},
    {"text": "Speculative decoding improves latency by 2-3x", "metadata": {"source": "benchmarks"}},
], engine=engine)

# Hybrid search (BM25 + semantic, RRF fused)
results = store.search("how to run big models on small GPUs", engine=engine, top_k=5)
for doc in results:
    print(f"[{doc.score:.3f}] {doc.text}")

No external vector DB. SQLite for persistence, numpy for similarity. Works offline.

Full RAG example with local LLM
from hippo.embedding import EmbeddingEngine, VectorStore
import openai

# 1. Index your documents (one-time)
engine = EmbeddingEngine(model="nomic-embed-text")
store = VectorStore("knowledge.db", mode="hybrid")

documents = [
    "Hippo splits model layers across multiple devices using TCP.",
    "Each device only loads its shard of layers, reducing memory per device.",
    "The loop detector catches semantic repetition using Jaccard similarity.",
    "BM25 hybrid search combines keyword matching with semantic similarity.",
]
store.add_batch([{"text": d} for d in documents], engine=engine)

# 2. RAG query
query = "how does hippo handle memory?"
results = store.search(query, engine=engine, top_k=2)
context = "\n".join(doc.text for doc in results)

# 3. Generate answer with local LLM
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query}
    ]
)
print(response.choices[0].message.content)

What's inside

Feature Details
Pipeline Parallelism Split any HF model across N machines. Mac + PC mixed. Plain TCP, no MPI.
Loop Detection Jaccard-similarity detector catches semantic repetition that repeat_penalty misses.
Embedding & Search Dense + BM25 + hybrid RRF fusion. SQLite-backed, sub-ms queries.
Chinese-optimized BM25 Built-in Chinese tokenizer with stop words. No jieba needed.
ANN Index Approximate nearest neighbor for large collections (>10K docs).
OpenAI-Compatible API Drop-in /v1/chat/completions. Works with LangChain, LlamaIndex, anything.
Auto Memory Budget Calculates shard splits from available VRAM automatically.

When to use Hippo

You want... Use this
Local inference on one machine --mode standalone with any GGUF model
Run a model too big for one device --mode pipeline across 2+ machines
RAG without installing ChromaDB VectorStore(mode="hybrid")
Search Chinese documents BM25 with built-in tokenizer

Install

pip install hippo-llm

Requirements: Python 3.10+, Ollama running locally for model weights and embeddings.

Roadmap

  • v0.3: ANN index for >10K document collections ✅
  • v0.4: Multi-shard support (>2 devices), automatic layer balancing
  • v0.5: Speculative decoding across shards
  • v0.6: Built-in model download + GGUF auto-conversion

Benchmarks

Setup Model Speed
Mac Mini M2 (16GB) Qwen3-4B-Q4 41 tok/s
RTX 5060 Ti (16GB) Qwen3-14B-Q4 41 tok/s
2× Mac Mini (16GB each) Qwen3-30B-A3B-Q3 78 tok/s
Mac Mini M2 (16GB) Qwen3-30B-A3B-Q3 24 tok/s

License

MIT

Author

lawcontinue — GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hippo_llm-0.3.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hippo_llm-0.3.0-py3-none-any.whl (68.7 kB view details)

Uploaded Python 3

File details

Details for the file hippo_llm-0.3.0.tar.gz.

File metadata

  • Download URL: hippo_llm-0.3.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hippo_llm-0.3.0.tar.gz
Algorithm Hash digest
SHA256 37fd05480576c0c7bc277ceede0a34e73926620a3c1f874eef4bee3c920b4f16
MD5 378d02b22723361c31f23e60f7ecdd9a
BLAKE2b-256 5d223acc40987143ef11794636de934ecd0e7e55a181985e584c012fb7f4ffcd

See more details on using hashes here.

File details

Details for the file hippo_llm-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: hippo_llm-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 68.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hippo_llm-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d80a945c225d65b0b67d0d636f3e52c999e322cf8f5d27f1b12a0cf49b171d9
MD5 5928956d153b0231662886d93133e26e
BLAKE2b-256 20c9c194d232e93296e0db0d998a3cdfeb366598acf2a0f295110a53c0bfd93c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page