Surrogate based Concept Retrieval for Large Datasets
Project description
Surrogate Concept Retrieval
surrogate_concept_retrieval
Implementation for the paper Concept Retrieval - What and How?
Package Status
✅ Added:
- Project URLs and Documentation links
- Keywords and classifiers for PyPI
- Populated
__init__.pyfor proper imports - Documentation structure with Sphinx
- Example code
- Improved README with usage examples
🔄 In Progress:
- Comprehensive documentation
- Test coverage
- CI/CD setup
Getting Started
# Install the package
pip install -e .
See RECOMMENDATIONS.md for full details on package improvements.
Overview
This package provides tools for extracting concepts from large datasets using surrogate concept retrieval method.
Features
- Fast embedding indexing using FAISS
- GPU-accelerated similarity computation
- Automatic concept extraction from embedding spaces
- Flexible concept filtering and refinement
- Support for projection-based concept analysis
Installation
# Install from PyPI
pip install coret
# Install with development dependencies
pip install "coret[dev]"
Quick Start
import numpy as np
from coret import ConceptRetrieval
# Load your embeddings (example uses random data)
embeddings = np.random.randn(1000, 768)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)
# Initialize concept retrieval
concept_retriever = ConceptRetrieval()
# Fit the model with embeddings
concept_retriever.fit(embeddings=embeddings)
# Select a random query embedding for demonstration
query_index = np.random.randint(0, len(embeddings))
query_embedding = embeddings[query_index]
# Retrieve concepts for the query
concepts = concept_retriever.retrieve(
query=query_embedding,
number_of_concepts=5,
number_of_samples_per_concept=5
)
# Print retrieved concepts
top_k_concepts_indices_s = concepts['top_k_concepts_indices_s']
print(f"Query index: {query_index}")
for i, concept_indices in enumerate(top_k_concepts_indices_s):
print(f"Concept {i+1}:")
print(f" Indices: {concept_indices}")
print()
Requirements
- Python 3.9+
- CUDA-compatible GPU (recommended for large datasets)
- Dependencies:
- numpy
- faiss-gpu (or faiss-cpu)
- scipy
- scikit-learn
- tqdm
- cupy (for GPU acceleration)
Documentation
For detailed API documentation and examples, please visit our documentation site.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coret-0.1.1-py3-none-any.whl.
File metadata
- Download URL: coret-0.1.1-py3-none-any.whl
- Upload date:
- Size: 41.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
497c356eb1285389a3e5798a8aa42a25a523fd93794fe71613dd816f75737531
|
|
| MD5 |
9b4a0827fd35ec7829a1a6e2d68f0e38
|
|
| BLAKE2b-256 |
9b9a7364d92e6fa18b502d04a65a37da20362072092a4cfaae6fb678228d1df2
|