Skip to main content

Local-first multi-format document ingestion engine with semantic search using sentence-transformers and ChromaDB

Project description


title: Zaza Semantic Engine emoji: 🧠 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: mit short_description: Local-first multilingual semantic search (50+ languages)

Zaza Semantic Engine

Local-first multi-format document ingestion engine with real semantic search.

Tests License: MIT Python 3.10+ PyPI version

Why Zaza?

Most document tools fall into two camps: cloud-based SaaS (your docs leave your machine) or dumb keyword search (finds exact word matches, misses the point). Zaza does both locally and semantically.

  • Local-first — your documents never leave your machine. No API keys, no data leaks.
  • Semantic search — find documents by meaning, not just keywords. Search "budget" and it finds "financial analysis", "quarterly results".
  • Multi-format — TXT, PDF, Markdown, DOCX, JSON, YAML, EPUB, CSV, HTML, XML. Ingest anything.
  • 50+ languages — built on paraphrase-multilingual-MiniLM-L12-v2. Search in French, English, Arabic, or any supported language.
  • Zero configzaza ingest ./docs/ and you're done.

Installation

# Core package
pip install -e .

# With API support
pip install -e ".[api]"

# With semantic search (embeddings + multilingual model)
pip install -e ".[semantic]"

# Full installation
pip install -e ".[all]"

Quick Start

# Ingest documents
zaza ingest ./my-documents/

# Keyword search (by filename)
zaza search "report"

# Semantic search (by meaning)
zaza search-semantic "financial analysis quarterly results" --top 5

# View stats
zaza stats

# Start API server (V3: either form works)
zaza api
zaza server

Semantic Search in Action

This project uses sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) to generate embeddings and ChromaDB for vector storage.

Unlike keyword search, semantic search finds documents with related concepts even when the exact words differ:

Query Keyword Search Semantic Search
"budget" Only files named "budget" Finds "financial report", "quarterly analysis", "cost breakdown"
"rapport financier" Only French files with exact match Finds "financial analysis", "balance sheet", "revenue summary"

Demo

Try it live on Hugging Face: Zaza Semantic Search Space

CLI Commands

Command Description
zaza ingest <path> Index documents from a directory or file
zaza search <query> Search documents by filename (keyword)
zaza search-semantic <query> Semantic search using embeddings
zaza stats Show indexing statistics
zaza documents List all indexed documents
zaza report [format] Generate report (json/csv)
zaza api Start the REST API server
zaza server V3 alias — same as zaza api

API Endpoints

Method Path Description
GET /health Health check
GET /summary Engine summary
GET /documents List documents
GET /search?q= Keyword search
GET /search-semantic?q=&top=10 Semantic search
GET /embeddings/status Check embedding store
POST /analyze Analyze raw text
POST /ingest/file Upload and ingest a file
POST /ingest/directory Ingest all files from directory

Supported Formats

Format Extension Method
Plain text .txt Direct read
Markdown .md, .markdown Syntax stripped
PDF .pdf via pypdf
CSV .csv Converted to key-value
HTML .html, .htm via BeautifulSoup
XML .xml Standard library
Word .docx via python-docx
JSON .json Recursive key-value (V3)
YAML .yaml, .yml Recursive key-value (V3)
ePUB .epub via ebooklib (V3, requires [semantic])

Model Caching (V3)

The embedding model is cached globally within a single process. zaza ingest + zaza search-semantic doesn't reload the model — it reuses the cached instance. Startup time drops significantly.

Configuration

Edit config.yaml to customize paths, embedding models, and search settings.

semantic:
  enabled: true                    # Set false to disable embeddings
  model_name: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
  embed_dir: "./data/embeddings"   # ChromaDB persist directory
  max_search_results: 10

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zaza_semantic_engine-3.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file zaza_semantic_engine-3.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for zaza_semantic_engine-3.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c03d522e9727c3cfc2b4b13b8346d421d66d1e7b31a1aed3a60163bc15b4fe23
MD5 515b0fc225146b0a04b71956a1d30a0f
BLAKE2b-256 95f1a24beb24c1909dcd1b4ee9c921d8ec48729707b3b66bfaf4e37fe6ed5262

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page