Skip to main content

Local LLM inference + embedding & search in one package. Run 30B on consumer hardware, RAG without ChromaDB.

Project description

Hippo 🦛

pip install hippo-llm | Python 3.10+ | MIT | 中文文档

Run 30B models on a ¥3800 GPU at 78 tok/s. Then search through your documents without installing ChromaDB.

30-second setup

hippo-pipeline serve --model qwen3-30b-a3b-q3 --mode standalone
# → OpenAI-compatible API at localhost:8000/v1/chat/completions
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
r = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[{"role": "user", "content": "Explain pipeline parallelism"}],
    max_tokens=500
)
print(r.choices[0].message.content)
Two-machine setup
# Machine 1
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 0

# Machine 2
hippo-pipeline serve --model gemma-3-12b --mode pipeline --rank 1 \
  --coordinator http://192.168.1.10:9000

Split the model across machines. Run what doesn't fit on one GPU.

One install for inference + search

Most RAG setups need two services: Ollama for inference + ChromaDB for vectors. Hippo gives you both in one pip install.

from hippo.embedding import EmbeddingEngine, VectorStore

engine = EmbeddingEngine(model="nomic-embed-text")  # uses local Ollama
store = VectorStore("docs.db", mode="hybrid")  # BM25 + dense RRF fusion

# Add documents
store.add_batch([
    {"text": "Pipeline parallelism splits layers across devices", "metadata": {"source": "readme"}},
    {"text": "BM25 handles exact keyword matches", "metadata": {"source": "docs"}},
    {"text": "Speculative decoding improves latency by 2-3x", "metadata": {"source": "benchmarks"}},
], engine=engine)

# Hybrid search (BM25 + semantic, RRF fused)
results = store.search("how to run big models on small GPUs", engine=engine, top_k=5)
for doc in results:
    print(f"[{doc.score:.3f}] {doc.text}")

No external vector DB. SQLite for persistence, numpy for similarity. Works offline.

Full RAG example with local LLM
from hippo.embedding import EmbeddingEngine, VectorStore
import openai

# 1. Index your documents (one-time)
engine = EmbeddingEngine(model="nomic-embed-text")
store = VectorStore("knowledge.db", mode="hybrid")

documents = [
    "Hippo splits model layers across multiple devices using TCP.",
    "Each device only loads its shard of layers, reducing memory per device.",
    "The loop detector catches semantic repetition using Jaccard similarity.",
    "BM25 hybrid search combines keyword matching with semantic similarity.",
]
store.add_batch([{"text": d} for d in documents], engine=engine)

# 2. RAG query
query = "how does hippo handle memory?"
results = store.search(query, engine=engine, top_k=2)
context = "\n".join(doc.text for doc in results)

# 3. Generate answer with local LLM
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="qwen3-30b-a3b-q3",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": query}
    ]
)
print(response.choices[0].message.content)

What's inside

Feature Details
Pipeline Parallelism Split any HF model across N machines. Mac + PC mixed. Plain TCP, no MPI.
Loop Detection Jaccard-similarity detector catches semantic repetition that repeat_penalty misses.
Embedding & Search Dense + BM25 + hybrid RRF fusion. SQLite-backed, sub-ms queries.
Chinese-optimized BM25 Built-in Chinese tokenizer with stop words. No jieba needed.
ANN Index Approximate nearest neighbor for large collections (>10K docs).
OpenAI-Compatible API Drop-in /v1/chat/completions. Works with LangChain, LlamaIndex, anything.
Auto Memory Budget Calculates shard splits from available VRAM automatically.

When to use Hippo

You want... Use this
Local inference on one machine --mode standalone with any GGUF model
Run a model too big for one device --mode pipeline across 2+ machines
RAG without installing ChromaDB VectorStore(mode="hybrid")
Search Chinese documents BM25 with built-in tokenizer

Install

pip install hippo-llm

Requirements: Python 3.10+, Ollama running locally for model weights and embeddings.

Roadmap

  • v0.3: ANN index for >10K document collections ✅
  • v0.4: Multi-shard support (>2 devices), automatic layer balancing
  • v0.5: Speculative decoding across shards
  • v0.6: Built-in model download + GGUF auto-conversion

Benchmarks

Setup Model Speed
Mac Mini M2 (16GB) Qwen3-4B-Q4 41 tok/s
RTX 5060 Ti (16GB) Qwen3-14B-Q4 41 tok/s
2× Mac Mini (16GB each) Qwen3-30B-A3B-Q3 78 tok/s
Mac Mini M2 (16GB) Qwen3-30B-A3B-Q3 24 tok/s

License

MIT

Author

lawcontinue — GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hippo_llm-0.3.1.tar.gz (63.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hippo_llm-0.3.1-py3-none-any.whl (72.5 kB view details)

Uploaded Python 3

File details

Details for the file hippo_llm-0.3.1.tar.gz.

File metadata

  • Download URL: hippo_llm-0.3.1.tar.gz
  • Upload date:
  • Size: 63.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hippo_llm-0.3.1.tar.gz
Algorithm Hash digest
SHA256 53b8acae00f659e35b6fed0e1a10250760f1643cdd3bdebad31fbac126e1b695
MD5 f26828da699da31eaeef83af2acc47ab
BLAKE2b-256 c6e2f1c5b6515b505051c1c2754f9f6786687505d7a47e0ba81e00cdddd63c2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for hippo_llm-0.3.1.tar.gz:

Publisher: publish.yml on lawcontinue/hippo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hippo_llm-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: hippo_llm-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 72.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hippo_llm-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 31056859e3069cb354b5e34d9f91ba7ad16a2ba43b13e9bb80a06265dde138ff
MD5 63961b3cbb5086f44d299c6f3d4f9072
BLAKE2b-256 206ccb6833a9967eac716cebf59c910b80c7c23c4d69f0ade19097b91140c647

See more details on using hashes here.

Provenance

The following attestation bundles were made for hippo_llm-0.3.1-py3-none-any.whl:

Publisher: publish.yml on lawcontinue/hippo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page