Aggregate embeddings in Qdrant collections with smart content concatenation

These details have not been verified by PyPI

Project links

Project description

Qdrant Vector Aggregator

A Python library for aggregating embeddings in Qdrant collections with smart content concatenation. Reduce your vector database size while maintaining semantic search quality and preserving complete document content.

🌟 Key Features

14 Aggregation Methods: Average, PCA, attention-based pooling, and more
Smart Content Concatenation: Automatically detects chunk ordering and concatenates text in proper sequence
Qdrant Cloud & Local Support: Works with both cloud and self-hosted instances
Batch Processing: Efficient handling of large collections with progress tracking
Flexible Grouping: Aggregate by any metadata field (document name, ID, category, etc.)
Production Ready: Includes error handling, logging, and verification tools

📊 What It Does

Transform chunked embeddings into document-level embeddings:

Input Collection (2,707 chunks)
├── Document A - Chunk 1 (embedding + text)
├── Document A - Chunk 2 (embedding + text)
├── Document A - Chunk 3 (embedding + text)
├── Document B - Chunk 1 (embedding + text)
└── ...

                    ↓ Aggregate

Output Collection (114 documents)
├── Document A (averaged embedding + concatenated text)
├── Document B (averaged embedding + concatenated text)
└── ...

Result: 23.75x compression with preserved semantic meaning and complete document text!

🚀 Quick Start

Installation

# Clone or download this repository
cd qdrant_vector_aggregator

# Install dependencies
pip install qdrant-client numpy scikit-learn python-dotenv

Configuration

Copy the example environment file:

cp .env.example .env

Edit .env with your Qdrant credentials:

QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-api-key-here

Basic Usage

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate embeddings by document name
aggregate_embeddings(
    input_collection_name="my_chunks_collection",
    column_name="metadata.document_name",  # Field to group by
    output_collection_name="my_documents_collection",
    method="average"  # Aggregation method
)

🎯 Smart Content Concatenation

The aggregator automatically handles page_content concatenation:

How It Works

Detects Ordering Fields: Checks for common ordering fields:
- chunk_index, chunk_number, chunk_id
- page, page_number, page_num
- sequence, order, index, position
- id (if sequential)
Sorts & Concatenates: If ordering found, sorts chunks and concatenates text in proper order
Adds Metadata: Includes aggregation statistics:
- chunk_count: Number of chunks aggregated
- has_ordered_content: Whether content was concatenated
- ordering_field: Which field was used for ordering

Example Result

{
    "page_content": "Chapter 1...\n\nChapter 2...\n\nChapter 3...",  # Concatenated in order
    "metadata": {
        "name": "Document Title",
        "id": 12345
    },
    "chunk_count": 34,
    "has_ordered_content": True,
    "ordering_field": "metadata.id"
}

If no ordering field is found, page_content is set to empty string.

📚 Available Aggregation Methods

Method	Description	Best For
`average`	Arithmetic mean (default)	General purpose, balanced
`weighted_average`	Weighted mean	When chunks have different importance
`pca`	Principal Component Analysis	Dimensionality reduction
`centroid`	K-Means centroid	Cluster-based aggregation
`attentive_pooling`	Attention-based pooling	Context-aware aggregation
`max_pooling`	Maximum values per dimension	Highlighting key features
`min_pooling`	Minimum values per dimension	Conservative aggregation
`median`	Element-wise median	Robust to outliers
`trimmed_mean`	Mean after trimming extremes	Outlier-resistant
`geometric_mean`	Geometric mean	Multiplicative relationships
`harmonic_mean`	Harmonic mean	Rate-based data
`power_mean`	Generalized mean	Flexible aggregation
`soft_dtw`	Soft Dynamic Time Warping	Sequence alignment
`procrustes`	Procrustes analysis	Shape-based alignment

🛠️ Included Tools

1. Test Connection

python3 test_connection.py

Verifies Qdrant connection and displays available collections.

2. Aggregate Collections

python3 aggregate_conventions.py

Example script showing how to aggregate a collection.

3. Verify Aggregation

python3 verify_aggregation.py

Checks aggregation results and content concatenation statistics.

4. Debug Aggregation

python3 debug_aggregation.py

Detailed debugging information for troubleshooting.

📖 Advanced Usage

Custom Aggregation

from qdrant_vector_aggregator import aggregate_embeddings
from qdrant_client.models import Distance

# PCA-based aggregation with custom settings
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.category",
    output_collection_name="aggregated_collection",
    method="pca",
    distance_metric=Distance.COSINE,
    qdrant_url="https://your-cluster.cloud.qdrant.io",
    api_key="your-api-key"
)

Weighted Average

# Aggregate with custom weights (e.g., by chunk importance)
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="weighted_collection",
    method="weighted_average",
    weights=[0.5, 0.3, 0.2]  # Weights for first 3 chunks
)

Attention-Based Pooling

# Context-aware aggregation
aggregate_embeddings(
    input_collection_name="source_collection",
    column_name="metadata.document_id",
    output_collection_name="attention_collection",
    method="attentive_pooling"
)

🔍 Searching Aggregated Collections

from qdrant_client import QdrantClient

client = QdrantClient(url="your-url", api_key="your-key")

# Search the aggregated collection
results = client.search(
    collection_name="aggregated_collection",
    query_vector=your_query_embedding,  # 1536-dim vector
    limit=5
)

# Each result now represents a complete document
for result in results:
    print(f"Document: {result.payload['metadata']['name']}")
    print(f"Score: {result.score}")
    print(f"Chunks: {result.payload['chunk_count']}")
    print(f"Content: {result.payload['page_content'][:200]}...")

📁 Project Structure

qdrant_vector_aggregator/
├── .env                          # Your credentials (not in git)
├── .env.example                  # Template
├── .gitignore                    # Git ignore rules
├── README.md                     # This file
├── SETUP_INSTRUCTIONS.md         # Detailed setup guide
├── LICENSE                       # MIT License
├── setup.py                      # Installation script
│
├── qdrant_vector_aggregator/     # Main package
│   ├── __init__.py              # Package initialization
│   ├── aggregator.py            # Core aggregation logic
│   ├── config.py                # Configuration management
│   ├── embedding_methods.py     # All 14 aggregation methods
│   ├── qdrant_collection_helpers.py  # Qdrant utilities
│   └── utils.py                 # Helper functions
│
├── test_connection.py           # Connection testing
├── example_usage.py             # Usage examples
├── aggregate_conventions.py     # Working example
├── debug_aggregation.py         # Debugging tool
└── verify_aggregation.py        # Verification tool

🎓 Real-World Example

From the included aggregate_conventions.py:

from qdrant_vector_aggregator import aggregate_embeddings

# Aggregate 2,707 convention chunks into 114 documents
result = aggregate_embeddings(
    input_collection_name="conventions_cadre_sectorielles",
    column_name="metadata.metadata.name",  # Nested field
    output_collection_name="conventions_aggregated",
    method="average"
)

# Results:
# ✅ Input: 2,707 chunks
# ✅ Output: 114 documents
# ✅ Compression: 23.75x
# ✅ Content: 100% concatenated in order
# ✅ Average length: 37,669 characters per document

🔧 Troubleshooting

Connection Issues

# Test your connection
python3 test_connection.py

Timeout Errors

The aggregator uses batch processing (100 points per batch) to prevent timeouts. For very large collections, you can adjust the batch size in utils.py.

Content Not Concatenating

Run the verification tool to check:

python3 verify_aggregation.py

This will show:

Which ordering field was detected (if any)
How many documents have concatenated content
Average content length

📝 Requirements

Python 3.7+
qdrant-client
numpy
scikit-learn
python-dotenv

🤝 Contributing

Contributions are welcome! Feel free to:

Add new aggregation methods
Improve content concatenation logic
Add more examples
Report issues

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Based on the original faiss_vector_aggregator project, adapted for Qdrant with enhanced features including smart content concatenation.

🔗 Repository

GitHub: qdrant_vector_aggregator

📞 Support

For issues or questions:

Check SETUP_INSTRUCTIONS.md for detailed setup help
Run debug_aggregation.py for troubleshooting
Review the example scripts for usage patterns

Made with ❤️ for the Qdrant community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Nov 3, 2025

1.0.1

Nov 3, 2025

This version

1.0.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qdrant_vector_aggregator-1.0.0.tar.gz (17.4 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qdrant_vector_aggregator-1.0.0-py3-none-any.whl (13.4 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file qdrant_vector_aggregator-1.0.0.tar.gz.

File metadata

Download URL: qdrant_vector_aggregator-1.0.0.tar.gz
Upload date: Nov 3, 2025
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for qdrant_vector_aggregator-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7a592668ee32d61658e57837bddb9b25f44f3ba1e8e2612203c368c18578df7d`
MD5	`c6b588da0b0cf82958b3127accc11e80`
BLAKE2b-256	`5fbcc008c5d2351be595774f81492a1acd960023c455d50134a579f944808286`

See more details on using hashes here.

File details

Details for the file qdrant_vector_aggregator-1.0.0-py3-none-any.whl.

File metadata

Download URL: qdrant_vector_aggregator-1.0.0-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 13.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for qdrant_vector_aggregator-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f2d80cdf7fe9835d779c86f53f680fa99c0da3fc2dadc1d0b2f82418897c22cc`
MD5	`7bde35202037095f29d37f3594e20681`
BLAKE2b-256	`6932e3588e47c34806d7501bc0c27d85fb7ab2ceabcb605e15f63f47c1ffd40e`

See more details on using hashes here.

qdrant-vector-aggregator 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Qdrant Vector Aggregator

🌟 Key Features

📊 What It Does

🚀 Quick Start

Installation

Configuration

Basic Usage

🎯 Smart Content Concatenation

How It Works

Example Result

📚 Available Aggregation Methods

🛠️ Included Tools

1. Test Connection

2. Aggregate Collections

3. Verify Aggregation

4. Debug Aggregation

📖 Advanced Usage

Custom Aggregation

Weighted Average

Attention-Based Pooling

🔍 Searching Aggregated Collections

📁 Project Structure

🎓 Real-World Example

🔧 Troubleshooting

Connection Issues

Timeout Errors

Content Not Concatenating

📝 Requirements

🤝 Contributing

📄 License

🙏 Acknowledgments

🔗 Repository

📞 Support

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes