Skip to main content

Surrogate based Concept Retrieval for Large Datasets

Project description

Surrogate Concept Retrieval

Python 3.9+ License: MIT

surrogate_concept_retrieval

Implementation for the paper Concept Retrieval - What and How?

Package Status

Added:

  • Project URLs and Documentation links
  • Keywords and classifiers for PyPI
  • Populated __init__.py for proper imports
  • Documentation structure with Sphinx
  • Example code
  • Improved README with usage examples

🔄 In Progress:

  • Comprehensive documentation
  • Test coverage
  • CI/CD setup

Getting Started

# Install the package
pip install -e .

See RECOMMENDATIONS.md for full details on package improvements.

Overview

This package provides tools for extracting concepts from large datasets using surrogate concept retrieval method.

Features

  • Fast embedding indexing using FAISS
  • GPU-accelerated similarity computation
  • Automatic concept extraction from embedding spaces
  • Flexible concept filtering and refinement
  • Support for projection-based concept analysis

Installation

# Install from PyPI
pip install coret

# Install with development dependencies
pip install "coret[dev]"

Quick Start

import numpy as np
from coret import ConceptRetrieval

# Load your embeddings (example uses random data)
embeddings = np.random.randn(1000, 768)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)

# Initialize concept retrieval
concept_retriever = ConceptRetrieval()

# Fit the model with embeddings
concept_retriever.fit(embeddings=embeddings)

# Select a random query embedding for demonstration
query_index = np.random.randint(0, len(embeddings))
query_embedding = embeddings[query_index]

# Retrieve concepts for the query
concepts = concept_retriever.retrieve(
    query=query_embedding,
    number_of_concepts=5,
    number_of_samples_per_concept=5
)

# Print retrieved concepts
top_k_concepts_indices_s = concepts['top_k_concepts_indices_s']

print(f"Query index: {query_index}")
for i, concept_indices in enumerate(top_k_concepts_indices_s):
  print(f"Concept {i+1}:")
  print(f"  Indices: {concept_indices}")
  print()

Requirements

  • Python 3.9+
  • CUDA-compatible GPU (recommended for large datasets)
  • Dependencies:
    • numpy
    • faiss-gpu (or faiss-cpu)
    • scipy
    • scikit-learn
    • tqdm
    • cupy (for GPU acceleration)

Documentation

For detailed API documentation and examples, please visit our documentation site.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coret-0.1.1-py3-none-any.whl (41.5 kB view details)

Uploaded Python 3

File details

Details for the file coret-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: coret-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 41.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for coret-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 497c356eb1285389a3e5798a8aa42a25a523fd93794fe71613dd816f75737531
MD5 9b4a0827fd35ec7829a1a6e2d68f0e38
BLAKE2b-256 9b9a7364d92e6fa18b502d04a65a37da20362072092a4cfaae6fb678228d1df2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page