Skip to main content

A library for embedding, indexing, and applying semantic search for text and image data

Project description

Deep Semantic Search

A Python library for embedding, indexing, and applying semantic search for text and image data.

Features

  • Multi-modal Semantic Search

    • Embed and index images using SigLIP SO400M (1152-dim, 384×384)
    • Embed and index text using BGE-M3 (1024-dim dense + sparse vectors)
    • Search images by image or text queries
    • Search text by semantic similarity with hybrid dense+sparse fusion
    • Cross-modal unified search across images and text in a shared embedding space
  • Clustering & Captioning

    • Cluster image embeddings using KMeans (specify k) or HDBSCAN (auto-detect)
    • Caption images using Florence-2 (detailed captions, object detection, OCR)
    • Customizable LLM-powered topic labeling via callback
  • Retrieval-Augmented Generation (RAG)

    • Answer questions based on text data using LiteLLM + Ollama
    • Semantic chunking with BGE-M3 embeddings
    • Cross-encoder reranking with BGE-reranker-v2-m3
    • Pluggable LLM via callback pattern
  • Duplicate Detection

    • Find near-duplicate images or text above a similarity threshold

Installation

pip install deep-semantic-search

Install with optional extras:

pip install deep-semantic-search[llm]          # RAG / question answering (LiteLLM)
pip install deep-semantic-search[clustering]   # Image clustering (scikit-learn)
pip install deep-semantic-search[viz]          # Plotting / visualization
pip install deep-semantic-search[all]          # Everything

For development:

pip install deep-semantic-search[dev]

Quick Start

Image Search

from deep_semantic_search import LoadImageData, ImageIndexer, ImageSearcher

# Load and index images
loader = LoadImageData()
image_paths = loader.from_folder(["path/to/images"])

indexer = ImageIndexer(image_paths)
indexer.run_index()

# Search by text
searcher = ImageSearcher(indexer)
results = searcher.search_by_text("cat on a sofa", n=5)
for r in results:
    print(f"{r['score']:.3f}  {r['path']}")

# Search by image
results = searcher.search_by_image("query.jpg", n=5)

# Find duplicate images
duplicates = searcher.find_duplicates(threshold=0.95)

Text Search

from deep_semantic_search import LoadTextData, TextEmbedder, TextSearch

# Load and embed text data
loader = LoadTextData()
corpus = loader.from_folder("path/to/text/files")

embedder = TextEmbedder()
embedder.embed(corpus)

# Search with hybrid dense+sparse fusion
search = TextSearch(embedder)
results = search.find_similar("your search query", top_n=5, hybrid=True)

# With cross-encoder reranking
results = search.find_similar("query", top_n=5, rerank=True)

Unified Cross-Modal Search

from deep_semantic_search import UnifiedIndexer, UnifiedSearcher

# Index images and texts in a shared embedding space
indexer = UnifiedIndexer()
indexer.add_images(image_paths)
indexer.add_texts(["description 1", "description 2"], labels=["doc1", "doc2"])
indexer.build_index()

# Search across modalities
searcher = UnifiedSearcher(indexer)
results = searcher.search("sunset over mountains", n=10)
# Filter by modality
results = searcher.search("sunset", modality_filter="image")

Image Clustering

from deep_semantic_search import ImageIndexer, ImageClusterer, ImageCaptioner

indexer = ImageIndexer(image_paths)
indexer.run_index()

# Auto-detect clusters with HDBSCAN
clusterer = ImageClusterer(indexer)
result = clusterer.cluster()  # n_clusters=None → HDBSCAN

# Or specify exact number with KMeans
result = clusterer.cluster(n_clusters=5)

# With Florence-2 captioning for topic labels
captioner = ImageCaptioner()
result = clusterer.cluster(n_clusters=5, captioner=captioner)

# Save organized clusters to disk
clusterer.save_clusters("./output/clusters")

RAG (Question Answering)

Requires pip install deep-semantic-search[llm] for LiteLLM.

from deep_semantic_search import RAG

texts = ["Document 1 content...", "Document 2 content..."]

# With semantic chunking and reranking
rag = RAG(rerank=True)
answer = rag.ask(texts, "What is the main topic?", semantic_chunking=True)

# With a custom LLM
answer = rag.ask(texts, "Summarize this.", llm_fn=my_custom_llm)

# Backward-compatible wrapper
from deep_semantic_search import ask_question
answer = ask_question(texts, "What is the main topic?", llm_fn=my_fn)

Custom Data Paths

By default, metadata is stored in ~/.deep-semantic-search/. Override per instance:

indexer = ImageIndexer(image_paths, metadata_dir="./my_project/index")
embedder = TextEmbedder(metadata_dir="./my_project/text_index")

API Reference

Image Module

  • LoadImageData — Load image paths from folders or CSV
  • ImageIndexer — SigLIP embedding + USearch indexing
  • ImageSearcher — Image/text similarity search + duplicate detection
  • ImageClusterer — KMeans/HDBSCAN clustering with topic labeling
  • ImageCaptioner — Florence-2 image captioning

Text Module

  • LoadTextData — Load text from folders (.txt/.html) or CSV
  • TextEmbedder — BGE-M3 dense + sparse embeddings
  • TextSearch — Hybrid search with optional reranking + duplicate detection

Unified Search

  • UnifiedIndexer — Cross-modal SigLIP indexing for images + text
  • UnifiedSearcher — Search across modalities

RAG

  • RAG — Object-oriented RAG with semantic chunking and reranking
  • ask_question() — Backward-compatible wrapper

Exceptions

  • DeepSemanticSearchError — Base exception
  • IndexNotFoundError, ModelLoadError, SearchError, EmbeddingError, ClusteringError, MigrationError, CaptioningError

CLI Tool

The package includes dss, a command-line interface for all major features.

General Usage

dss --help              # Show all commands
dss --version           # Show version
dss <command> --help    # Help for a specific command

Global flags: -v/--verbose for debug output, -q/--quiet to suppress progress.

Image Search

# Search by text
dss image-search --folder ./photos --query "sunset over the ocean" --top 5

# Search by image
dss image-search --folder ./photos --query ./photos/reference.jpg --top 10

# Multiple folders, JSON output
dss image-search -f ./photos -f ./vacation --query "mountains" --format json

Text Search

# Basic search (hybrid enabled by default)
dss text-search --folder ./documents "machine learning algorithms" --top 5

# With reranking
dss text-search -f ./docs "neural networks" --rerank

# Dense-only (no sparse fusion)
dss text-search -f ./docs "query" --no-hybrid

Image Clustering

# KMeans with explicit k
dss image-cluster --folder ./photos --clusters 5

# HDBSCAN auto-detection (omit -k)
dss image-cluster -f ./photos --min-cluster-size 3

# With Florence-2 captioning for topic labels
dss image-cluster -f ./photos -k 5 --caption

# Save clustered images
dss image-cluster -f ./photos -k 8 --save-dir ./output/clusters

Unified Search

# Search across images and text
dss unified-search --image-folder ./photos --text-folder ./docs --query "sunset"

# Filter by modality
dss unified-search --image-folder ./photos --query "sunset" --filter image

Duplicate Detection

dss find-duplicates --folder ./photos --threshold 0.95

RAG (Question Answering)

dss ask --folder ./documents "What is the main conclusion?"

# With reranking and semantic chunking (default)
dss ask -f ./docs "Summarize the findings" --rerank

# Fixed chunking
dss ask -f ./docs "question" --no-semantic-chunking

Configuration

The CLI respects environment variables:

  • OLLAMA_LLM_MODEL — LLM model for RAG (default: gemma4:e4b)

Requirements

  • Python >= 3.10
  • PyTorch, Sentence Transformers, Transformers, USearch, FlagEmbedding, and more (auto-installed)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_semantic_search-3.0.1.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deep_semantic_search-3.0.1-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file deep_semantic_search-3.0.1.tar.gz.

File metadata

  • Download URL: deep_semantic_search-3.0.1.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for deep_semantic_search-3.0.1.tar.gz
Algorithm Hash digest
SHA256 e057fd14eda2422033f3c65ec4175c75ab41f4f9a7fbdb425f2ab7d510d879ad
MD5 eefd78c6b5012bd7b508a3bcd23c896f
BLAKE2b-256 1aa6f6afe3c1a64dad28608ae77f1b2d32dceacdd0d07190e18c6a49ddb0f9f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for deep_semantic_search-3.0.1.tar.gz:

Publisher: publish.yml on Harduex/deep-semantic-search

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deep_semantic_search-3.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for deep_semantic_search-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d7efc69e1c69a848d9769ddb8d7ca8ba33bbb56c95f59a878abf5877cd59cee
MD5 52cc3bde0e46f70035ca65dd9180363f
BLAKE2b-256 f9056e6e20afe0b5b68fba9d3eb3f9b5519e7f4daeffe46e1f112af8ced6d9dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for deep_semantic_search-3.0.1-py3-none-any.whl:

Publisher: publish.yml on Harduex/deep-semantic-search

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page