A library for embedding, indexing, and applying semantic search for text and image data
Project description
Deep Semantic Search
A Python library for embedding, indexing, and applying semantic search for text and image data.
Features
-
Multi-modal Semantic Search
- Embed and index images using SigLIP SO400M (1152-dim, 384×384)
- Embed and index text using BGE-M3 (1024-dim dense + sparse vectors)
- Search images by image or text queries
- Search text by semantic similarity with hybrid dense+sparse fusion
- Cross-modal unified search across images and text in a shared embedding space
-
Clustering & Captioning
- Cluster image embeddings using KMeans (specify k) or HDBSCAN (auto-detect)
- Caption images using Florence-2 (detailed captions, object detection, OCR)
- Customizable LLM-powered topic labeling via callback
-
Retrieval-Augmented Generation (RAG)
- Answer questions based on text data using LiteLLM + Ollama
- Semantic chunking with BGE-M3 embeddings
- Cross-encoder reranking with BGE-reranker-v2-m3
- Pluggable LLM via callback pattern
-
Duplicate Detection
- Find near-duplicate images or text above a similarity threshold
Installation
pip install deep-semantic-search
Install with optional extras:
pip install deep-semantic-search[llm] # RAG / question answering (LiteLLM)
pip install deep-semantic-search[clustering] # Image clustering (scikit-learn)
pip install deep-semantic-search[viz] # Plotting / visualization
pip install deep-semantic-search[all] # Everything
For development:
pip install deep-semantic-search[dev]
Quick Start
Image Search
from deep_semantic_search import LoadImageData, ImageIndexer, ImageSearcher
# Load and index images
loader = LoadImageData()
image_paths = loader.from_folder(["path/to/images"])
indexer = ImageIndexer(image_paths)
indexer.run_index()
# Search by text
searcher = ImageSearcher(indexer)
results = searcher.search_by_text("cat on a sofa", n=5)
for r in results:
print(f"{r['score']:.3f} {r['path']}")
# Search by image
results = searcher.search_by_image("query.jpg", n=5)
# Find duplicate images
duplicates = searcher.find_duplicates(threshold=0.95)
Text Search
from deep_semantic_search import LoadTextData, TextEmbedder, TextSearch
# Load and embed text data
loader = LoadTextData()
corpus = loader.from_folder("path/to/text/files")
embedder = TextEmbedder()
embedder.embed(corpus)
# Search with hybrid dense+sparse fusion
search = TextSearch(embedder)
results = search.find_similar("your search query", top_n=5, hybrid=True)
# With cross-encoder reranking
results = search.find_similar("query", top_n=5, rerank=True)
Unified Cross-Modal Search
from deep_semantic_search import UnifiedIndexer, UnifiedSearcher
# Index images and texts in a shared embedding space
indexer = UnifiedIndexer()
indexer.add_images(image_paths)
indexer.add_texts(["description 1", "description 2"], labels=["doc1", "doc2"])
indexer.build_index()
# Search across modalities
searcher = UnifiedSearcher(indexer)
results = searcher.search("sunset over mountains", n=10)
# Filter by modality
results = searcher.search("sunset", modality_filter="image")
Image Clustering
from deep_semantic_search import ImageIndexer, ImageClusterer, ImageCaptioner
indexer = ImageIndexer(image_paths)
indexer.run_index()
# Auto-detect clusters with HDBSCAN
clusterer = ImageClusterer(indexer)
result = clusterer.cluster() # n_clusters=None → HDBSCAN
# Or specify exact number with KMeans
result = clusterer.cluster(n_clusters=5)
# With Florence-2 captioning for topic labels
captioner = ImageCaptioner()
result = clusterer.cluster(n_clusters=5, captioner=captioner)
# Save organized clusters to disk
clusterer.save_clusters("./output/clusters")
RAG (Question Answering)
Requires pip install deep-semantic-search[llm] for LiteLLM.
from deep_semantic_search import RAG
texts = ["Document 1 content...", "Document 2 content..."]
# With semantic chunking and reranking
rag = RAG(rerank=True)
answer = rag.ask(texts, "What is the main topic?", semantic_chunking=True)
# With a custom LLM
answer = rag.ask(texts, "Summarize this.", llm_fn=my_custom_llm)
# Backward-compatible wrapper
from deep_semantic_search import ask_question
answer = ask_question(texts, "What is the main topic?", llm_fn=my_fn)
Custom Data Paths
By default, metadata is stored in ~/.deep-semantic-search/. Override per instance:
indexer = ImageIndexer(image_paths, metadata_dir="./my_project/index")
embedder = TextEmbedder(metadata_dir="./my_project/text_index")
API Reference
Image Module
LoadImageData— Load image paths from folders or CSVImageIndexer— SigLIP embedding + USearch indexingImageSearcher— Image/text similarity search + duplicate detectionImageClusterer— KMeans/HDBSCAN clustering with topic labelingImageCaptioner— Florence-2 image captioning
Text Module
LoadTextData— Load text from folders (.txt/.html) or CSVTextEmbedder— BGE-M3 dense + sparse embeddingsTextSearch— Hybrid search with optional reranking + duplicate detection
Unified Search
UnifiedIndexer— Cross-modal SigLIP indexing for images + textUnifiedSearcher— Search across modalities
RAG
RAG— Object-oriented RAG with semantic chunking and rerankingask_question()— Backward-compatible wrapper
Exceptions
DeepSemanticSearchError— Base exceptionIndexNotFoundError,ModelLoadError,SearchError,EmbeddingError,ClusteringError,MigrationError,CaptioningError
CLI Tool
The package includes dss, a command-line interface for all major features.
General Usage
dss --help # Show all commands
dss --version # Show version
dss <command> --help # Help for a specific command
Global flags: -v/--verbose for debug output, -q/--quiet to suppress progress.
Image Search
# Search by text
dss image-search --folder ./photos --query "sunset over the ocean" --top 5
# Search by image
dss image-search --folder ./photos --query ./photos/reference.jpg --top 10
# Multiple folders, JSON output
dss image-search -f ./photos -f ./vacation --query "mountains" --format json
Text Search
# Basic search (hybrid enabled by default)
dss text-search --folder ./documents "machine learning algorithms" --top 5
# With reranking
dss text-search -f ./docs "neural networks" --rerank
# Dense-only (no sparse fusion)
dss text-search -f ./docs "query" --no-hybrid
Image Clustering
# KMeans with explicit k
dss image-cluster --folder ./photos --clusters 5
# HDBSCAN auto-detection (omit -k)
dss image-cluster -f ./photos --min-cluster-size 3
# With Florence-2 captioning for topic labels
dss image-cluster -f ./photos -k 5 --caption
# Save clustered images
dss image-cluster -f ./photos -k 8 --save-dir ./output/clusters
Unified Search
# Search across images and text
dss unified-search --image-folder ./photos --text-folder ./docs --query "sunset"
# Filter by modality
dss unified-search --image-folder ./photos --query "sunset" --filter image
Duplicate Detection
dss find-duplicates --folder ./photos --threshold 0.95
RAG (Question Answering)
dss ask --folder ./documents "What is the main conclusion?"
# With reranking and semantic chunking (default)
dss ask -f ./docs "Summarize the findings" --rerank
# Fixed chunking
dss ask -f ./docs "question" --no-semantic-chunking
Configuration
The CLI respects environment variables:
OLLAMA_LLM_MODEL— LLM model for RAG (default:gemma4:e4b)
Requirements
- Python >= 3.10
- PyTorch, Sentence Transformers, Transformers, USearch, FlagEmbedding, and more (auto-installed)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deep_semantic_search-3.0.3.tar.gz.
File metadata
- Download URL: deep_semantic_search-3.0.3.tar.gz
- Upload date:
- Size: 34.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c5fc3ff9f469c49d94099999af506c83247a489ed5109f48f3591fb62e9066e
|
|
| MD5 |
a857e5f8dba62a3fbb2f3970bd1e009a
|
|
| BLAKE2b-256 |
1c34c14217e28d86eeebdc6cb3aabb5d3d63addc56754bd778b51f400c7f4d22
|
Provenance
The following attestation bundles were made for deep_semantic_search-3.0.3.tar.gz:
Publisher:
publish.yml on Harduex/deep-semantic-search
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deep_semantic_search-3.0.3.tar.gz -
Subject digest:
2c5fc3ff9f469c49d94099999af506c83247a489ed5109f48f3591fb62e9066e - Sigstore transparency entry: 1288279651
- Sigstore integration time:
-
Permalink:
Harduex/deep-semantic-search@2cc43948c53bd72747b1cf48d55aedf94c459539 -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/Harduex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2cc43948c53bd72747b1cf48d55aedf94c459539 -
Trigger Event:
push
-
Statement type:
File details
Details for the file deep_semantic_search-3.0.3-py3-none-any.whl.
File metadata
- Download URL: deep_semantic_search-3.0.3-py3-none-any.whl
- Upload date:
- Size: 33.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2075095d2d254666858aeab1f6906d8b4cc6b500d5ed954b1cddb102823034f
|
|
| MD5 |
1afb7ef34bfd22536e808a6b408700a7
|
|
| BLAKE2b-256 |
af07fb85732c9f10f9e7ff58aea6fc3e05c75efd0426591fe428890ffdf339ab
|
Provenance
The following attestation bundles were made for deep_semantic_search-3.0.3-py3-none-any.whl:
Publisher:
publish.yml on Harduex/deep-semantic-search
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deep_semantic_search-3.0.3-py3-none-any.whl -
Subject digest:
a2075095d2d254666858aeab1f6906d8b4cc6b500d5ed954b1cddb102823034f - Sigstore transparency entry: 1288279685
- Sigstore integration time:
-
Permalink:
Harduex/deep-semantic-search@2cc43948c53bd72747b1cf48d55aedf94c459539 -
Branch / Tag:
refs/tags/v3.0.3 - Owner: https://github.com/Harduex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2cc43948c53bd72747b1cf48d55aedf94c459539 -
Trigger Event:
push
-
Statement type: