Skip to main content

Local-first multi-format document ingestion engine with semantic search using sentence-transformers and ChromaDB

Project description

Zaza Semantic Engine

Local-first multi-format document ingestion engine with real semantic search.

Tests License: MIT Python 3.10+ PyPI version

Why Zaza?

Most document tools fall into two camps: cloud-based SaaS (your docs leave your machine) or dumb keyword search (finds exact word matches, misses the point). Zaza does both locally and semantically.

  • Local-first — your documents never leave your machine. No API keys, no data leaks.
  • Semantic search — find documents by meaning, not just keywords. Search "budget" and it finds "financial analysis", "quarterly results".
  • Multi-format — TXT, PDF, Markdown, DOCX, JSON, YAML, EPUB, CSV, HTML, XML. Ingest anything.
  • 50+ languages — built on paraphrase-multilingual-MiniLM-L12-v2. Search in French, English, Arabic, or any supported language.
  • Zero configzaza ingest ./docs/ and you're done.

Installation

# Core package
pip install -e .

# With API support
pip install -e ".[api]"

# With semantic search (embeddings + multilingual model)
pip install -e ".[semantic]"

# Full installation
pip install -e ".[all]"

Quick Start

# Ingest documents
zaza ingest ./my-documents/

# Keyword search (by filename)
zaza search "report"

# Semantic search (by meaning)
zaza search-semantic "financial analysis quarterly results" --top 5

# View stats
zaza stats

# Start API server (V3: either form works)
zaza api
zaza server

Semantic Search in Action

This project uses sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) to generate embeddings and ChromaDB for vector storage.

Unlike keyword search, semantic search finds documents with related concepts even when the exact words differ:

Query Keyword Search Semantic Search
"budget" Only files named "budget" Finds "financial report", "quarterly analysis", "cost breakdown"
"rapport financier" Only French files with exact match Finds "financial analysis", "balance sheet", "revenue summary"

Demo

Try it live on Hugging Face: Zaza Semantic Search Space

CLI Commands

Command Description
zaza ingest <path> Index documents from a directory or file
zaza search <query> Search documents by filename (keyword)
zaza search-semantic <query> Semantic search using embeddings
zaza stats Show indexing statistics
zaza documents List all indexed documents
zaza report [format] Generate report (json/csv)
zaza api Start the REST API server
zaza server V3 alias — same as zaza api

API Endpoints

Method Path Description
GET /health Health check
GET /summary Engine summary
GET /documents List documents
GET /search?q= Keyword search
GET /search-semantic?q=&top=10 Semantic search
GET /embeddings/status Check embedding store
POST /analyze Analyze raw text
POST /ingest/file Upload and ingest a file
POST /ingest/directory Ingest all files from directory

Supported Formats

Format Extension Method
Plain text .txt Direct read
Markdown .md, .markdown Syntax stripped
PDF .pdf via pypdf
CSV .csv Converted to key-value
HTML .html, .htm via BeautifulSoup
XML .xml Standard library
Word .docx via python-docx
JSON .json Recursive key-value (V3)
YAML .yaml, .yml Recursive key-value (V3)
ePUB .epub via ebooklib (V3, requires [semantic])

Model Caching (V3)

The embedding model is cached globally within a single process. zaza ingest + zaza search-semantic doesn't reload the model — it reuses the cached instance. Startup time drops significantly.

Configuration

Edit config.yaml to customize paths, embedding models, and search settings.

semantic:
  enabled: true                    # Set false to disable embeddings
  model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
  embed_dir: "./data/embeddings"   # ChromaDB persist directory
  max_search_results: 10

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zaza_semantic_engine-3.0.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zaza_semantic_engine-3.0.1-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file zaza_semantic_engine-3.0.1.tar.gz.

File metadata

  • Download URL: zaza_semantic_engine-3.0.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for zaza_semantic_engine-3.0.1.tar.gz
Algorithm Hash digest
SHA256 c2cc289278506f4fdced6f3d179339ce2710093f350366919c3feeca663cbeac
MD5 7d7f907c36b0fa5704859b4bf5e2e38f
BLAKE2b-256 30a51d797b2fd2cccd452a71f7b27383f76a17e02f15939171228f2132f3b00f

See more details on using hashes here.

File details

Details for the file zaza_semantic_engine-3.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for zaza_semantic_engine-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 293a9a4229145b16dbc8215caa6f97b239761837786c4b62397d219998bb148d
MD5 222149c9d92dd80c03289ecf3423c7ec
BLAKE2b-256 29a5dc61b65149b07145a7fe2f7e7a34e35388a7f5e716ccdf53a1e309287a19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page