Local-first multi-format document ingestion engine with semantic search using sentence-transformers and ChromaDB
Project description
title: Zaza Semantic Engine emoji: 🧠 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: mit short_description: Local-first multilingual semantic search (50+ languages)
Zaza Semantic Engine
Local-first multi-format document ingestion engine with real semantic search.
Why Zaza?
Most document tools fall into two camps: cloud-based SaaS (your docs leave your machine) or dumb keyword search (finds exact word matches, misses the point). Zaza does both locally and semantically.
- Local-first — your documents never leave your machine. No API keys, no data leaks.
- Semantic search — find documents by meaning, not just keywords. Search "budget" and it finds "financial analysis", "quarterly results".
- Multi-format — TXT, PDF, Markdown, DOCX, JSON, YAML, EPUB, CSV, HTML, XML. Ingest anything.
- 50+ languages — built on
paraphrase-multilingual-MiniLM-L12-v2. Search in French, English, Arabic, or any supported language. - Zero config —
zaza ingest ./docs/and you're done.
Installation
# Core package
pip install -e .
# With API support
pip install -e ".[api]"
# With semantic search (embeddings + multilingual model)
pip install -e ".[semantic]"
# Full installation
pip install -e ".[all]"
Quick Start
# Ingest documents
zaza ingest ./my-documents/
# Keyword search (by filename)
zaza search "report"
# Semantic search (by meaning)
zaza search-semantic "financial analysis quarterly results" --top 5
# View stats
zaza stats
# Start API server (V3: either form works)
zaza api
zaza server
Semantic Search in Action
This project uses sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) to generate embeddings and ChromaDB for vector storage.
Unlike keyword search, semantic search finds documents with related concepts even when the exact words differ:
| Query | Keyword Search | Semantic Search |
|---|---|---|
| "budget" | Only files named "budget" | Finds "financial report", "quarterly analysis", "cost breakdown" |
| "rapport financier" | Only French files with exact match | Finds "financial analysis", "balance sheet", "revenue summary" |
Demo
Try it live on Hugging Face: Zaza Semantic Search Space
CLI Commands
| Command | Description |
|---|---|
zaza ingest <path> |
Index documents from a directory or file |
zaza search <query> |
Search documents by filename (keyword) |
zaza search-semantic <query> |
Semantic search using embeddings |
zaza stats |
Show indexing statistics |
zaza documents |
List all indexed documents |
zaza report [format] |
Generate report (json/csv) |
zaza api |
Start the REST API server |
zaza server |
V3 alias — same as zaza api |
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health |
Health check |
| GET | /summary |
Engine summary |
| GET | /documents |
List documents |
| GET | /search?q= |
Keyword search |
| GET | /search-semantic?q=&top=10 |
Semantic search |
| GET | /embeddings/status |
Check embedding store |
| POST | /analyze |
Analyze raw text |
| POST | /ingest/file |
Upload and ingest a file |
| POST | /ingest/directory |
Ingest all files from directory |
Supported Formats
| Format | Extension | Method |
|---|---|---|
| Plain text | .txt |
Direct read |
| Markdown | .md, .markdown |
Syntax stripped |
.pdf |
via pypdf |
|
| CSV | .csv |
Converted to key-value |
| HTML | .html, .htm |
via BeautifulSoup |
| XML | .xml |
Standard library |
| Word | .docx |
via python-docx |
| JSON | .json |
Recursive key-value (V3) |
| YAML | .yaml, .yml |
Recursive key-value (V3) |
| ePUB | .epub |
via ebooklib (V3, requires [semantic]) |
Model Caching (V3)
The embedding model is cached globally within a single process. zaza ingest + zaza search-semantic doesn't reload the model — it reuses the cached instance. Startup time drops significantly.
Configuration
Edit config.yaml to customize paths, embedding models, and search settings.
semantic:
enabled: true # Set false to disable embeddings
model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embed_dir: "./data/embeddings" # ChromaDB persist directory
max_search_results: 10
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zaza_semantic_engine-3.2.0-py3-none-any.whl.
File metadata
- Download URL: zaza_semantic_engine-3.2.0-py3-none-any.whl
- Upload date:
- Size: 29.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c03d522e9727c3cfc2b4b13b8346d421d66d1e7b31a1aed3a60163bc15b4fe23
|
|
| MD5 |
515b0fc225146b0a04b71956a1d30a0f
|
|
| BLAKE2b-256 |
95f1a24beb24c1909dcd1b4ee9c921d8ec48729707b3b66bfaf4e37fe6ed5262
|