A library for embedding, indexing, and applying semantic search for text and image data
Project description
Deep Semantic Search
A Python library for embedding, indexing, and applying semantic search for text and image data.
Features
-
Multi-modal Semantic Search
- Embed and index text data using Sentence Transformers (paraphrase-multilingual-MiniLM-L12-v2)
- Embed and index image data using CLIP
- Search images by image or text queries
- Search text by semantic similarity
-
Clustering & Captioning
- Cluster image embeddings using PyTorch KMeans (GPU support)
- Caption images using BLIP
- Customizable LLM-powered topic labeling via callback
-
Retrieval-Augmented Generation (RAG)
- Answer questions based on text data
- Pluggable LLM via callback pattern
Installation
pip install deep-semantic-search
For development:
pip install deep-semantic-search[dev]
Quick Start
Image Search
from deep_semantic_search import LoadImageData, ImageIndexer, ImageSearcher
# Load images
loader = LoadImageData()
image_paths = loader.from_folder(["path/to/images"])
# Index images
indexer = ImageIndexer(image_paths)
indexer.run_index()
# Search by text
searcher = ImageSearcher(indexer)
results = searcher.search_by_text("cat on a sofa", n=5)
for path, score in results.items():
print(f"{score:.3f} {path}")
# Search by image
results = searcher.search_by_image("query.jpg", n=5)
Text Search
from deep_semantic_search import LoadTextData, TextEmbedder, TextSearch
# Load text data
loader = LoadTextData()
corpus = loader.from_folder("path/to/text/files")
# Embed
embedder = TextEmbedder()
embedder.embed(corpus)
# Search
search = TextSearch(embedder)
results = search.find_similar("your search query", top_n=5)
for r in results:
print(f"Score: {r['score']:.3f} {r['path']}")
Image Clustering
from deep_semantic_search import ImageIndexer, ImageClusterer, ImageCaptioner
indexer = ImageIndexer(image_paths)
indexer.run_index()
# Optional: use captioner for topic labels
captioner = ImageCaptioner()
clusterer = ImageClusterer(indexer)
result = clusterer.cluster(n_clusters=5, captioner=captioner)
# Save organized clusters to disk
clusterer.save_clusters("./output/clusters")
RAG (Question Answering)
from deep_semantic_search import ask_question
texts = ["Document 1 content...", "Document 2 content..."]
answer = ask_question(texts, "What is the main topic?")
print(answer)
# With a custom LLM
answer = ask_question(texts, "Summarize this.", llm_fn=my_custom_llm)
Custom Data Paths
By default, metadata is stored in ~/.deep-semantic-search/. Override per instance:
indexer = ImageIndexer(image_paths, metadata_dir="./my_project/index")
embedder = TextEmbedder(metadata_dir="./my_project/text_index")
API Reference
Image Module
LoadImageData— Load image paths from folders or CSVImageIndexer— CLIP embedding + FAISS indexingImageSearcher— Image/text similarity searchImageClusterer— KMeans clustering with topic labelingImageCaptioner— BLIP image captioning
Text Module
LoadTextData— Load text from folders (.txt/.html) or CSVTextEmbedder— Sentence Transformer embeddingsTextSearch— Cosine similarity search
RAG
ask_question()— RAG Q&A with pluggable LLM
Exceptions
DeepSemanticSearchError— Base exceptionIndexNotFoundError,ModelLoadError,SearchError,EmbeddingError,ClusteringError
CLI Tool
The package includes dss, a command-line interface for all major features. After installing the package, the dss command is available globally.
General Usage
dss --help # Show all commands
dss --version # Show version
dss <command> --help # Help for a specific command
Global flags: -v/--verbose for debug output, -q/--quiet to suppress progress.
Image Search
Search images by text query or by image similarity:
# Search by text
dss image-search --folder ./photos --query "sunset over the ocean" --top 5
# Search by image
dss image-search --folder ./photos --query ./photos/reference.jpg --top 10
# Multiple folders, JSON output
dss image-search -f ./photos -f ./vacation --query "mountains" --format json
# Force re-indexing
dss image-search -f ./photos --query "cat" --reindex
Text Search
Search text documents by semantic similarity:
dss text-search --folder ./documents "machine learning algorithms" --top 5
# CSV output
dss text-search -f ./docs "neural networks" --format csv
# Custom model
dss text-search -f ./docs "query" --model sentence-transformers/all-MiniLM-L6-v2
Image Clustering
Cluster images using KMeans on CLIP embeddings:
# Basic clustering
dss image-cluster --folder ./photos --clusters 5
# With BLIP captioning for topic labels
dss image-cluster -f ./photos -k 5 --caption
# Save clustered images into organized folders
dss image-cluster -f ./photos -k 8 --caption --save-dir ./output/clusters
# JSON output
dss image-cluster -f ./photos -k 3 --format json
RAG (Question Answering)
Ask questions over text documents using Retrieval-Augmented Generation:
dss ask --folder ./documents "What is the main conclusion?"
# Custom Ollama model
dss ask -f ./research "Summarize the findings" --model llama2:13b
# Adjust chunking
dss ask -f ./docs "question" --chunk-size 2000 --chunk-overlap 200
Configuration
The CLI respects environment variables:
OLLAMA_LLM_MODEL— LLM model for RAG (default:gemma4:e4b)DEFAULT_SEARCH_FOLDER_PATH— Default folder path
All CLI flags override environment variables when provided.
Requirements
- Python >= 3.10
- PyTorch, Sentence Transformers, Transformers, FAISS, LangChain, and more (auto-installed)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deep_semantic_search-1.1.4.tar.gz.
File metadata
- Download URL: deep_semantic_search-1.1.4.tar.gz
- Upload date:
- Size: 23.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bf129b030ac04068f4ee47f236e46bda383d4ad79d6d7db5ba07cb07d597ff8
|
|
| MD5 |
3eda91a189ab7f5efbedc5abf8f49f3f
|
|
| BLAKE2b-256 |
0c28bdfdadbb73ac4776f6e5cd44a79f3b3432d52fa8e99294329de6474446dc
|
Provenance
The following attestation bundles were made for deep_semantic_search-1.1.4.tar.gz:
Publisher:
publish.yml on Harduex/deep-semantic-search
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deep_semantic_search-1.1.4.tar.gz -
Subject digest:
1bf129b030ac04068f4ee47f236e46bda383d4ad79d6d7db5ba07cb07d597ff8 - Sigstore transparency entry: 1283246829
- Sigstore integration time:
-
Permalink:
Harduex/deep-semantic-search@d3a6c90583da780338d71936b1e69e3c655a0425 -
Branch / Tag:
refs/tags/v1.1.4 - Owner: https://github.com/Harduex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3a6c90583da780338d71936b1e69e3c655a0425 -
Trigger Event:
push
-
Statement type:
File details
Details for the file deep_semantic_search-1.1.4-py3-none-any.whl.
File metadata
- Download URL: deep_semantic_search-1.1.4-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b142a3f37bbb512a0b1e1eb904f2c8524aaa02c4f0c9da8c1bc8b0ec172dd1e
|
|
| MD5 |
6fd4322afb26f40b7286f9993ec3bea1
|
|
| BLAKE2b-256 |
6de6923fe3ee5c8170f80508b199dcbb3626848f66caa0e8fc0617d990f639b6
|
Provenance
The following attestation bundles were made for deep_semantic_search-1.1.4-py3-none-any.whl:
Publisher:
publish.yml on Harduex/deep-semantic-search
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
deep_semantic_search-1.1.4-py3-none-any.whl -
Subject digest:
3b142a3f37bbb512a0b1e1eb904f2c8524aaa02c4f0c9da8c1bc8b0ec172dd1e - Sigstore transparency entry: 1283246916
- Sigstore integration time:
-
Permalink:
Harduex/deep-semantic-search@d3a6c90583da780338d71936b1e69e3c655a0425 -
Branch / Tag:
refs/tags/v1.1.4 - Owner: https://github.com/Harduex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3a6c90583da780338d71936b1e69e3c655a0425 -
Trigger Event:
push
-
Statement type: