Skip to main content

Aggregate embeddings in Qdrant collections with smart content concatenation

Project description

Qdrant Vector Aggregator

A Python library for aggregating embeddings in Qdrant collections with smart content concatenation. Reduce your vector database size while maintaining semantic search quality and preserving complete document content.

๐ŸŒŸ Key Features

  • 14 Aggregation Methods: Average, PCA, attention-based pooling, and more
  • Smart Content Concatenation: Automatically detects chunk ordering and concatenates text in proper sequence
  • Qdrant Cloud & Local Support: Works with both cloud and self-hosted instances
  • Batch Processing: Efficient handling of large collections with progress tracking
  • Flexible Grouping: Aggregate by any metadata field (document name, ID, category, etc.)
  • Production Ready: Includes error handling, logging, and verification tools

๐Ÿ“Š What It Does

Transform chunked embeddings into document-level embeddings:

Input Collection (many chunks)
โ”œโ”€โ”€ Document A - Chunk 1 (embedding + text)
โ”œโ”€โ”€ Document A - Chunk 2 (embedding + text)
โ”œโ”€โ”€ Document A - Chunk 3 (embedding + text)
โ”œโ”€โ”€ Document B - Chunk 1 (embedding + text)
โ””โ”€โ”€ ...

                    โ†“ Aggregate

Output Collection (fewer documents)
โ”œโ”€โ”€ Document A (averaged embedding + concatenated text)
โ”œโ”€โ”€ Document B (averaged embedding + concatenated text)
โ””โ”€โ”€ ...

Result: Significant compression with preserved semantic meaning and complete document text!

๐Ÿš€ Quick Start

Installation

# Clone or download this repository
cd qdrant_vector_aggregator

# Install dependencies
pip install qdrant-client numpy scikit-learn python-dotenv

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env with your Qdrant credentials:
QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-api-key-here

Basic Usage

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate embeddings by document name
aggregate_embeddings(
    input_collection_name="my_chunks_collection",
    column_name="metadata.document_name",  # Field to group by
    output_collection_name="my_documents_collection",
    method="average"  # Aggregation method
)

๐ŸŽฏ Smart Content Concatenation

The aggregator automatically handles page_content concatenation:

How It Works

  1. Detects Ordering Fields: Checks for common ordering fields:

    • chunk_index, chunk_number, chunk_id
    • page, page_number, page_num
    • sequence, order, index, position
    • id (if sequential)
  2. Sorts & Concatenates: If ordering found, sorts chunks and concatenates text in proper order

  3. Adds Metadata: Includes aggregation statistics:

    • chunk_count: Number of chunks aggregated
    • has_ordered_content: Whether content was concatenated
    • ordering_field: Which field was used for ordering

Example Result

{
    "page_content": "Chapter 1...\n\nChapter 2...\n\nChapter 3...",  # Concatenated in order
    "metadata": {
        "name": "Document Title",
        "id": 12345
    },
    "chunk_count": 34,
    "has_ordered_content": True,
    "ordering_field": "metadata.id"
}

If no ordering field is found, page_content is set to empty string.

๐Ÿ“š Available Aggregation Methods

Method Description Best For
average Arithmetic mean (default) General purpose, balanced
weighted_average Weighted mean When chunks have different importance
pca Principal Component Analysis Dimensionality reduction
centroid K-Means centroid Cluster-based aggregation
attentive_pooling Attention-based pooling Context-aware aggregation
max_pooling Maximum values per dimension Highlighting key features
min_pooling Minimum values per dimension Conservative aggregation
median Element-wise median Robust to outliers
trimmed_mean Mean after trimming extremes Outlier-resistant
geometric_mean Geometric mean Multiplicative relationships
harmonic_mean Harmonic mean Rate-based data
power_mean Generalized mean Flexible aggregation
soft_dtw Soft Dynamic Time Warping Sequence alignment
procrustes Procrustes analysis Shape-based alignment

๐Ÿ› ๏ธ Included Tools

1. Test Connection

python3 test_connection.py

Verifies Qdrant connection and displays available collections.

2. Example Usage

python3 example_usage.py

Example script showing how to aggregate a collection.

3. Verify Aggregation

python3 verify_aggregation.py

Checks aggregation results and content concatenation statistics.

4. Debug Aggregation

python3 debug_aggregation.py

Detailed debugging information for troubleshooting.

๐Ÿ“– Advanced Usage

Custom Aggregation

from qdrant_vector_aggregator import aggregate_embeddings
from qdrant_client.models import Distance

# PCA-based aggregation with custom settings
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.category",
    output_collection_name="aggregated_collection",
    method="pca",
    distance_metric=Distance.COSINE,
    qdrant_url="https://your-cluster.cloud.qdrant.io",
    api_key="your-api-key"
)

Weighted Average

# Aggregate with custom weights (e.g., by chunk importance)
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="weighted_collection",
    method="weighted_average",
    weights=[0.5, 0.3, 0.2]  # Weights for first 3 chunks
)

Attention-Based Pooling

# Context-aware aggregation
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="attention_collection",
    method="attentive_pooling"
)

๐Ÿ” Searching Aggregated Collections

from qdrant_client import QdrantClient

client = QdrantClient(url="your-url", api_key="your-key")

# Search the aggregated collection
results = client.search(
    collection_name="aggregated_collection",
    query_vector=your_query_embedding,  # 1536-dim vector
    limit=5
)

# Each result now represents a complete document
for result in results:
    print(f"Document: {result.payload['metadata']['name']}")
    print(f"Score: {result.score}")
    print(f"Chunks: {result.payload['chunk_count']}")
    print(f"Content: {result.payload['page_content'][:200]}...")

๐Ÿ“ Project Structure

qdrant_vector_aggregator/
โ”œโ”€โ”€ .env                          # Your credentials (not in git)
โ”œโ”€โ”€ .env.example                  # Template
โ”œโ”€โ”€ .gitignore                    # Git ignore rules
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ SETUP_INSTRUCTIONS.md         # Detailed setup guide
โ”œโ”€โ”€ LICENSE                       # MIT License
โ”œโ”€โ”€ setup.py                      # Installation script
โ”‚
โ”œโ”€โ”€ qdrant_vector_aggregator/     # Main package
โ”‚   โ”œโ”€โ”€ __init__.py              # Package initialization
โ”‚   โ”œโ”€โ”€ aggregator.py            # Core aggregation logic
โ”‚   โ”œโ”€โ”€ config.py                # Configuration management
โ”‚   โ”œโ”€โ”€ embedding_methods.py     # All 14 aggregation methods
โ”‚   โ”œโ”€โ”€ qdrant_collection_helpers.py  # Qdrant utilities
โ”‚   โ””โ”€โ”€ utils.py                 # Helper functions
โ”‚
โ”œโ”€โ”€ test_connection.py           # Connection testing
โ”œโ”€โ”€ example_usage.py             # Usage examples
โ”œโ”€โ”€ debug_aggregation.py         # Debugging tool
โ””โ”€โ”€ verify_aggregation.py        # Verification tool

๐ŸŽ“ Real-World Example

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate document chunks into complete documents
result = aggregate_embeddings(
    input_collection_name="my_document_chunks",
    column_name="metadata.document_name",  # Group by document name
    output_collection_name="my_complete_documents",
    method="average"
)

# Example results:
# โœ… Significant compression ratio
# โœ… Content automatically concatenated in proper order
# โœ… Semantic meaning preserved
# โœ… Ready for document-level semantic search

๐Ÿ”ง Troubleshooting

Connection Issues

# Test your connection
python3 test_connection.py

Timeout Errors

The aggregator uses batch processing (100 points per batch) to prevent timeouts. For very large collections, you can adjust the batch size in utils.py.

Content Not Concatenating

Run the verification tool to check:

python3 verify_aggregation.py

This will show:

  • Which ordering field was detected (if any)
  • How many documents have concatenated content
  • Average content length

๐Ÿ“ Requirements

  • Python 3.7+
  • qdrant-client
  • numpy
  • scikit-learn
  • python-dotenv

๐Ÿค Contributing

Contributions are welcome! Feel free to:

  • Add new aggregation methods
  • Improve content concatenation logic
  • Add more examples
  • Report issues

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

Based on the original faiss_vector_aggregator project, adapted for Qdrant with enhanced features including smart content concatenation.

๐Ÿ”— Repository

GitHub: qdrant_vector_aggregator

๐Ÿ“ž Support

For issues or questions:

  1. Check SETUP_INSTRUCTIONS.md for detailed setup help
  2. Run debug_aggregation.py for troubleshooting
  3. Review the example scripts for usage patterns

Made with โค๏ธ for the Qdrant community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qdrant_vector_aggregator-1.0.1.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qdrant_vector_aggregator-1.0.1-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file qdrant_vector_aggregator-1.0.1.tar.gz.

File metadata

File hashes

Hashes for qdrant_vector_aggregator-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3ed86699f9bcf8f1440a5a71c3dd6b0545028231ad9f3d5e320dfabc3e7665a7
MD5 8bae3dc2a64458003960f8f06a1a33c3
BLAKE2b-256 e5cd10837a021869c77d9e067dec8613d4723776ddfadfb75109522d3bf678ab

See more details on using hashes here.

File details

Details for the file qdrant_vector_aggregator-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for qdrant_vector_aggregator-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 52e900698397ff419a4f471f8564d3643c91c21bec960a5ccf538473275d702c
MD5 1584de94fda8ae2a443fbccd9725f8b7
BLAKE2b-256 e81a06a8c3fb767535eb418809c3e105e8fcbd75d17b2e16b292648da6f826ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page