Skip to main content

Local-first multi-format document ingestion engine with semantic search using sentence-transformers and ChromaDB

Project description


title: Zaza Semantic Engine emoji: 🧠 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: mit short_description: Local-first multilingual semantic search (50+ languages)

Zaza Semantic Engine

Local-first multi-format document ingestion engine with real semantic search.

Tests License: MIT Python 3.10+ PyPI version

Why Zaza?

Most document tools fall into two camps: cloud-based SaaS (your docs leave your machine) or dumb keyword search (finds exact word matches, misses the point). Zaza does both locally and semantically.

  • Local-first — your documents never leave your machine. No API keys, no data leaks.
  • Semantic search — find documents by meaning, not just keywords. Search "budget" and it finds "financial analysis", "quarterly results".
  • Multi-format — TXT, PDF, Markdown, DOCX, JSON, YAML, EPUB, CSV, HTML, XML. Ingest anything.
  • 50+ languages — built on paraphrase-multilingual-MiniLM-L12-v2. Search in French, English, Arabic, or any supported language.
  • Zero configzaza ingest ./docs/ and you're done.

Installation

# Core package
pip install -e .

# With API support
pip install -e ".[api]"

# With semantic search (embeddings + multilingual model)
pip install -e ".[semantic]"

# Full installation
pip install -e ".[all]"

Quick Start

# Ingest documents
zaza ingest ./my-documents/

# Keyword search (by filename)
zaza search "report"

# Semantic search (by meaning)
zaza search-semantic "financial analysis quarterly results" --top 5

# View stats
zaza stats

# Start API server (V3: either form works)
zaza api
zaza server

Semantic Search in Action

This project uses sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) to generate embeddings and ChromaDB for vector storage.

Unlike keyword search, semantic search finds documents with related concepts even when the exact words differ:

Query Keyword Search Semantic Search
"budget" Only files named "budget" Finds "financial report", "quarterly analysis", "cost breakdown"
"rapport financier" Only French files with exact match Finds "financial analysis", "balance sheet", "revenue summary"

Demo

Try it live on Hugging Face: Zaza Semantic Search Space

CLI Commands

Command Description
zaza ingest <path> Index documents from a directory or file
zaza search <query> Search documents by filename (keyword)
zaza search-semantic <query> Semantic search using embeddings
zaza stats Show indexing statistics
zaza documents List all indexed documents
zaza report [format] Generate report (json/csv)
zaza api Start the REST API server
zaza server V3 alias — same as zaza api

API Endpoints

Method Path Description
GET /health Health check
GET /summary Engine summary
GET /documents List documents
GET /search?q= Keyword search
GET /search-semantic?q=&top=10 Semantic search
GET /embeddings/status Check embedding store
POST /analyze Analyze raw text
POST /ingest/file Upload and ingest a file
POST /ingest/directory Ingest all files from directory

Supported Formats

Format Extension Method
Plain text .txt Direct read
Markdown .md, .markdown Syntax stripped
PDF .pdf via pypdf
CSV .csv Converted to key-value
HTML .html, .htm via BeautifulSoup
XML .xml Standard library
Word .docx via python-docx
JSON .json Recursive key-value (V3)
YAML .yaml, .yml Recursive key-value (V3)
ePUB .epub via ebooklib (V3, requires [semantic])

Model Caching (V3)

The embedding model is cached globally within a single process. zaza ingest + zaza search-semantic doesn't reload the model — it reuses the cached instance. Startup time drops significantly.

Configuration

Edit config.yaml to customize paths, embedding models, and search settings.

semantic:
  enabled: true                    # Set false to disable embeddings
  model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
  embed_dir: "./data/embeddings"   # ChromaDB persist directory
  max_search_results: 10

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zaza_semantic_engine-3.0.2.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zaza_semantic_engine-3.0.2-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file zaza_semantic_engine-3.0.2.tar.gz.

File metadata

  • Download URL: zaza_semantic_engine-3.0.2.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for zaza_semantic_engine-3.0.2.tar.gz
Algorithm Hash digest
SHA256 73f8c384075348b2e57d0e03e71aba9823fe14d0b23ec040d9e589cc971f86f9
MD5 2fa8f34358e7ec052f6c92329219c8a6
BLAKE2b-256 9b3da0aa8e85791912ff861a8371719975562918d22acfb8be4d32becf1e404c

See more details on using hashes here.

File details

Details for the file zaza_semantic_engine-3.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for zaza_semantic_engine-3.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3e4ad1080af98a89c58379e48993d5630a8d9f7737c6b3244876336db904deeb
MD5 61e76cff0a82745d26c4767e86b206b0
BLAKE2b-256 cdee40c2457e993929f2c06a2eb32e7995ed17e94dcf422ce8df5bbbaa8f118b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page