Skip to main content

Local-first multi-format document ingestion engine with semantic search using sentence-transformers and ChromaDB

Project description

Zaza Semantic Engine

Local-first multi-format document ingestion engine with real semantic search.

Tests License: MIT Python 3.10+ PyPI version

Why Zaza?

Most document tools fall into two camps: cloud-based SaaS (your docs leave your machine) or dumb keyword search (finds exact word matches, misses the point). Zaza does both locally and semantically.

  • Local-first — your documents never leave your machine. No API keys, no data leaks.
  • Semantic search — find documents by meaning, not just keywords. Search "budget" and it finds "financial analysis", "quarterly results".
  • Multi-format — TXT, PDF, Markdown, DOCX, JSON, YAML, EPUB, CSV, HTML, XML. Ingest anything.
  • 50+ languages — built on paraphrase-multilingual-MiniLM-L12-v2. Search in French, English, Arabic, or any supported language.
  • Zero configzaza ingest ./docs/ and you're done.

Installation

# Core package
pip install -e .

# With API support
pip install -e ".[api]"

# With semantic search (embeddings + multilingual model)
pip install -e ".[semantic]"

# Full installation
pip install -e ".[all]"

Quick Start

# Ingest documents
zaza ingest ./my-documents/

# Keyword search (by filename)
zaza search "report"

# Semantic search (by meaning)
zaza search-semantic "financial analysis quarterly results" --top 5

# View stats
zaza stats

# Start API server (V3: either form works)
zaza api
zaza server

Semantic Search in Action

This project uses sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) to generate embeddings and ChromaDB for vector storage.

Unlike keyword search, semantic search finds documents with related concepts even when the exact words differ:

Query Keyword Search Semantic Search
"budget" Only files named "budget" Finds "financial report", "quarterly analysis", "cost breakdown"
"rapport financier" Only French files with exact match Finds "financial analysis", "balance sheet", "revenue summary"

CLI Commands

Command Description
zaza ingest <path> Index documents from a directory or file
zaza search <query> Search documents by filename (keyword)
zaza search-semantic <query> Semantic search using embeddings
zaza stats Show indexing statistics
zaza documents List all indexed documents
zaza report [format] Generate report (json/csv)
zaza api Start the REST API server
zaza server V3 alias — same as zaza api

API Endpoints

Method Path Description
GET /health Health check
GET /summary Engine summary
GET /documents List documents
GET /search?q= Keyword search
GET /search-semantic?q=&top=10 Semantic search
GET /embeddings/status Check embedding store
POST /analyze Analyze raw text
POST /ingest/file Upload and ingest a file
POST /ingest/directory Ingest all files from directory

Supported Formats

Format Extension Method
Plain text .txt Direct read
Markdown .md, .markdown Syntax stripped
PDF .pdf via pypdf
CSV .csv Converted to key-value
HTML .html, .htm via BeautifulSoup
XML .xml Standard library
Word .docx via python-docx
JSON .json Recursive key-value (V3)
YAML .yaml, .yml Recursive key-value (V3)
ePUB .epub via ebooklib (V3, requires [semantic])

Model Caching (V3)

The embedding model is cached globally within a single process. zaza ingest + zaza search-semantic doesn't reload the model — it reuses the cached instance. Startup time drops significantly.

Configuration

Edit config.yaml to customize paths, embedding models, and search settings.

semantic:
  enabled: true                    # Set false to disable embeddings
  model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
  embed_dir: "./data/embeddings"   # ChromaDB persist directory
  max_search_results: 10

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zaza_semantic_engine-3.0.0.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zaza_semantic_engine-3.0.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file zaza_semantic_engine-3.0.0.tar.gz.

File metadata

  • Download URL: zaza_semantic_engine-3.0.0.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for zaza_semantic_engine-3.0.0.tar.gz
Algorithm Hash digest
SHA256 10bbafb01f279761fca2f853a2e2111c46ef74d0c1cd29470a202f8694f6bac9
MD5 ce73f42697fbbed02dd358b3224c6c95
BLAKE2b-256 69abf01fb0debf07aff49bd5b2174d56ee771b9fd7efd8685be39afc281e5f68

See more details on using hashes here.

File details

Details for the file zaza_semantic_engine-3.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for zaza_semantic_engine-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ebc5845c0b2dbb0ff48183611c515fb53ff827f369f77c77df80543a1014404
MD5 15c3e6d7f126fc6ed3d8f6f2fb57c040
BLAKE2b-256 c76872cbe1ff2353964a5269a643458c5e280b7ed9dfc188c9b080ce480a904a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page