Skip to main content

Aggregate embeddings in Qdrant collections with smart content concatenation

Project description

Qdrant Vector Aggregator

A Python library for aggregating embeddings in Qdrant collections with smart content concatenation. Reduce your vector database size while maintaining semantic search quality and preserving complete document content.

๐ŸŒŸ Key Features

  • 14 Aggregation Methods: Average, PCA, attention-based pooling, and more
  • Smart Content Concatenation: Automatically detects chunk ordering and concatenates text in proper sequence
  • Qdrant Cloud & Local Support: Works with both cloud and self-hosted instances
  • Batch Processing: Efficient handling of large collections with progress tracking
  • Flexible Grouping: Aggregate by any metadata field (document name, ID, category, etc.)
  • Production Ready: Includes error handling, logging, and verification tools

๐Ÿ“Š What It Does

Transform chunked embeddings into document-level embeddings:

Input Collection (2,707 chunks)
โ”œโ”€โ”€ Document A - Chunk 1 (embedding + text)
โ”œโ”€โ”€ Document A - Chunk 2 (embedding + text)
โ”œโ”€โ”€ Document A - Chunk 3 (embedding + text)
โ”œโ”€โ”€ Document B - Chunk 1 (embedding + text)
โ””โ”€โ”€ ...

                    โ†“ Aggregate

Output Collection (114 documents)
โ”œโ”€โ”€ Document A (averaged embedding + concatenated text)
โ”œโ”€โ”€ Document B (averaged embedding + concatenated text)
โ””โ”€โ”€ ...

Result: 23.75x compression with preserved semantic meaning and complete document text!

๐Ÿš€ Quick Start

Installation

# Clone or download this repository
cd qdrant_vector_aggregator

# Install dependencies
pip install qdrant-client numpy scikit-learn python-dotenv

Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env with your Qdrant credentials:
QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-api-key-here

Basic Usage

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate embeddings by document name
aggregate_embeddings(
    input_collection_name="my_chunks_collection",
    column_name="metadata.document_name",  # Field to group by
    output_collection_name="my_documents_collection",
    method="average"  # Aggregation method
)

๐ŸŽฏ Smart Content Concatenation

The aggregator automatically handles page_content concatenation:

How It Works

  1. Detects Ordering Fields: Checks for common ordering fields:

    • chunk_index, chunk_number, chunk_id
    • page, page_number, page_num
    • sequence, order, index, position
    • id (if sequential)
  2. Sorts & Concatenates: If ordering found, sorts chunks and concatenates text in proper order

  3. Adds Metadata: Includes aggregation statistics:

    • chunk_count: Number of chunks aggregated
    • has_ordered_content: Whether content was concatenated
    • ordering_field: Which field was used for ordering

Example Result

{
    "page_content": "Chapter 1...\n\nChapter 2...\n\nChapter 3...",  # Concatenated in order
    "metadata": {
        "name": "Document Title",
        "id": 12345
    },
    "chunk_count": 34,
    "has_ordered_content": True,
    "ordering_field": "metadata.id"
}

If no ordering field is found, page_content is set to empty string.

๐Ÿ“š Available Aggregation Methods

Method Description Best For
average Arithmetic mean (default) General purpose, balanced
weighted_average Weighted mean When chunks have different importance
pca Principal Component Analysis Dimensionality reduction
centroid K-Means centroid Cluster-based aggregation
attentive_pooling Attention-based pooling Context-aware aggregation
max_pooling Maximum values per dimension Highlighting key features
min_pooling Minimum values per dimension Conservative aggregation
median Element-wise median Robust to outliers
trimmed_mean Mean after trimming extremes Outlier-resistant
geometric_mean Geometric mean Multiplicative relationships
harmonic_mean Harmonic mean Rate-based data
power_mean Generalized mean Flexible aggregation
soft_dtw Soft Dynamic Time Warping Sequence alignment
procrustes Procrustes analysis Shape-based alignment

๐Ÿ› ๏ธ Included Tools

1. Test Connection

python3 test_connection.py

Verifies Qdrant connection and displays available collections.

2. Aggregate Collections

python3 aggregate_conventions.py

Example script showing how to aggregate a collection.

3. Verify Aggregation

python3 verify_aggregation.py

Checks aggregation results and content concatenation statistics.

4. Debug Aggregation

python3 debug_aggregation.py

Detailed debugging information for troubleshooting.

๐Ÿ“– Advanced Usage

Custom Aggregation

from qdrant_vector_aggregator import aggregate_embeddings
from qdrant_client.models import Distance

# PCA-based aggregation with custom settings
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.category",
    output_collection_name="aggregated_collection",
    method="pca",
    distance_metric=Distance.COSINE,
    qdrant_url="https://your-cluster.cloud.qdrant.io",
    api_key="your-api-key"
)

Weighted Average

# Aggregate with custom weights (e.g., by chunk importance)
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="weighted_collection",
    method="weighted_average",
    weights=[0.5, 0.3, 0.2]  # Weights for first 3 chunks
)

Attention-Based Pooling

# Context-aware aggregation
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="attention_collection",
    method="attentive_pooling"
)

๐Ÿ” Searching Aggregated Collections

from qdrant_client import QdrantClient

client = QdrantClient(url="your-url", api_key="your-key")

# Search the aggregated collection
results = client.search(
    collection_name="aggregated_collection",
    query_vector=your_query_embedding,  # 1536-dim vector
    limit=5
)

# Each result now represents a complete document
for result in results:
    print(f"Document: {result.payload['metadata']['name']}")
    print(f"Score: {result.score}")
    print(f"Chunks: {result.payload['chunk_count']}")
    print(f"Content: {result.payload['page_content'][:200]}...")

๐Ÿ“ Project Structure

qdrant_vector_aggregator/
โ”œโ”€โ”€ .env                          # Your credentials (not in git)
โ”œโ”€โ”€ .env.example                  # Template
โ”œโ”€โ”€ .gitignore                    # Git ignore rules
โ”œโ”€โ”€ README.md                     # This file
โ”œโ”€โ”€ SETUP_INSTRUCTIONS.md         # Detailed setup guide
โ”œโ”€โ”€ LICENSE                       # MIT License
โ”œโ”€โ”€ setup.py                      # Installation script
โ”‚
โ”œโ”€โ”€ qdrant_vector_aggregator/     # Main package
โ”‚   โ”œโ”€โ”€ __init__.py              # Package initialization
โ”‚   โ”œโ”€โ”€ aggregator.py            # Core aggregation logic
โ”‚   โ”œโ”€โ”€ config.py                # Configuration management
โ”‚   โ”œโ”€โ”€ embedding_methods.py     # All 14 aggregation methods
โ”‚   โ”œโ”€โ”€ qdrant_collection_helpers.py  # Qdrant utilities
โ”‚   โ””โ”€โ”€ utils.py                 # Helper functions
โ”‚
โ”œโ”€โ”€ test_connection.py           # Connection testing
โ”œโ”€โ”€ example_usage.py             # Usage examples
โ”œโ”€โ”€ aggregate_conventions.py     # Working example
โ”œโ”€โ”€ debug_aggregation.py         # Debugging tool
โ””โ”€โ”€ verify_aggregation.py        # Verification tool

๐ŸŽ“ Real-World Example

From the included aggregate_conventions.py:

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate 2,707 convention chunks into 114 documents
result = aggregate_embeddings(
    input_collection_name="conventions_cadre_sectorielles",
    column_name="metadata.metadata.name",  # Nested field
    output_collection_name="conventions_aggregated",
    method="average"
)

# Results:
# โœ… Input: 2,707 chunks
# โœ… Output: 114 documents
# โœ… Compression: 23.75x
# โœ… Content: 100% concatenated in order
# โœ… Average length: 37,669 characters per document

๐Ÿ”ง Troubleshooting

Connection Issues

# Test your connection
python3 test_connection.py

Timeout Errors

The aggregator uses batch processing (100 points per batch) to prevent timeouts. For very large collections, you can adjust the batch size in utils.py.

Content Not Concatenating

Run the verification tool to check:

python3 verify_aggregation.py

This will show:

  • Which ordering field was detected (if any)
  • How many documents have concatenated content
  • Average content length

๐Ÿ“ Requirements

  • Python 3.7+
  • qdrant-client
  • numpy
  • scikit-learn
  • python-dotenv

๐Ÿค Contributing

Contributions are welcome! Feel free to:

  • Add new aggregation methods
  • Improve content concatenation logic
  • Add more examples
  • Report issues

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

Based on the original faiss_vector_aggregator project, adapted for Qdrant with enhanced features including smart content concatenation.

๐Ÿ”— Repository

GitHub: qdrant_vector_aggregator

๐Ÿ“ž Support

For issues or questions:

  1. Check SETUP_INSTRUCTIONS.md for detailed setup help
  2. Run debug_aggregation.py for troubleshooting
  3. Review the example scripts for usage patterns

Made with โค๏ธ for the Qdrant community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qdrant_vector_aggregator-1.0.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qdrant_vector_aggregator-1.0.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file qdrant_vector_aggregator-1.0.0.tar.gz.

File metadata

File hashes

Hashes for qdrant_vector_aggregator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7a592668ee32d61658e57837bddb9b25f44f3ba1e8e2612203c368c18578df7d
MD5 c6b588da0b0cf82958b3127accc11e80
BLAKE2b-256 5fbcc008c5d2351be595774f81492a1acd960023c455d50134a579f944808286

See more details on using hashes here.

File details

Details for the file qdrant_vector_aggregator-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for qdrant_vector_aggregator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2d80cdf7fe9835d779c86f53f680fa99c0da3fc2dadc1d0b2f82418897c22cc
MD5 7bde35202037095f29d37f3594e20681
BLAKE2b-256 6932e3588e47c34806d7501bc0c27d85fb7ab2ceabcb605e15f63f47c1ffd40e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page