Aggregate embeddings in Qdrant collections with smart content concatenation
Project description
Qdrant Vector Aggregator
A Python library for aggregating embeddings in Qdrant collections with smart content concatenation. Reduce your vector database size while maintaining semantic search quality and preserving complete document content.
๐ Key Features
- 14 Aggregation Methods: Average, PCA, attention-based pooling, and more
- Smart Content Concatenation: Automatically detects chunk ordering and concatenates text in proper sequence
- Qdrant Cloud & Local Support: Works with both cloud and self-hosted instances
- Batch Processing: Efficient handling of large collections with progress tracking
- Flexible Grouping: Aggregate by any metadata field (document name, ID, category, etc.)
- Production Ready: Includes error handling, logging, and verification tools
๐ What It Does
Transform chunked embeddings into document-level embeddings:
Input Collection (2,707 chunks)
โโโ Document A - Chunk 1 (embedding + text)
โโโ Document A - Chunk 2 (embedding + text)
โโโ Document A - Chunk 3 (embedding + text)
โโโ Document B - Chunk 1 (embedding + text)
โโโ ...
โ Aggregate
Output Collection (114 documents)
โโโ Document A (averaged embedding + concatenated text)
โโโ Document B (averaged embedding + concatenated text)
โโโ ...
Result: 23.75x compression with preserved semantic meaning and complete document text!
๐ Quick Start
Installation
# Clone or download this repository
cd qdrant_vector_aggregator
# Install dependencies
pip install qdrant-client numpy scikit-learn python-dotenv
Configuration
- Copy the example environment file:
cp .env.example .env
- Edit
.envwith your Qdrant credentials:
QDRANT_URL=https://your-cluster.cloud.qdrant.io
QDRANT_API_KEY=your-api-key-here
Basic Usage
from qdrant_vector_aggregator import aggregate_embeddings
# Aggregate embeddings by document name
aggregate_embeddings(
input_collection_name="my_chunks_collection",
column_name="metadata.document_name", # Field to group by
output_collection_name="my_documents_collection",
method="average" # Aggregation method
)
๐ฏ Smart Content Concatenation
The aggregator automatically handles page_content concatenation:
How It Works
-
Detects Ordering Fields: Checks for common ordering fields:
chunk_index,chunk_number,chunk_idpage,page_number,page_numsequence,order,index,positionid(if sequential)
-
Sorts & Concatenates: If ordering found, sorts chunks and concatenates text in proper order
-
Adds Metadata: Includes aggregation statistics:
chunk_count: Number of chunks aggregatedhas_ordered_content: Whether content was concatenatedordering_field: Which field was used for ordering
Example Result
{
"page_content": "Chapter 1...\n\nChapter 2...\n\nChapter 3...", # Concatenated in order
"metadata": {
"name": "Document Title",
"id": 12345
},
"chunk_count": 34,
"has_ordered_content": True,
"ordering_field": "metadata.id"
}
If no ordering field is found, page_content is set to empty string.
๐ Available Aggregation Methods
| Method | Description | Best For |
|---|---|---|
average |
Arithmetic mean (default) | General purpose, balanced |
weighted_average |
Weighted mean | When chunks have different importance |
pca |
Principal Component Analysis | Dimensionality reduction |
centroid |
K-Means centroid | Cluster-based aggregation |
attentive_pooling |
Attention-based pooling | Context-aware aggregation |
max_pooling |
Maximum values per dimension | Highlighting key features |
min_pooling |
Minimum values per dimension | Conservative aggregation |
median |
Element-wise median | Robust to outliers |
trimmed_mean |
Mean after trimming extremes | Outlier-resistant |
geometric_mean |
Geometric mean | Multiplicative relationships |
harmonic_mean |
Harmonic mean | Rate-based data |
power_mean |
Generalized mean | Flexible aggregation |
soft_dtw |
Soft Dynamic Time Warping | Sequence alignment |
procrustes |
Procrustes analysis | Shape-based alignment |
๐ ๏ธ Included Tools
1. Test Connection
python3 test_connection.py
Verifies Qdrant connection and displays available collections.
2. Aggregate Collections
python3 aggregate_conventions.py
Example script showing how to aggregate a collection.
3. Verify Aggregation
python3 verify_aggregation.py
Checks aggregation results and content concatenation statistics.
4. Debug Aggregation
python3 debug_aggregation.py
Detailed debugging information for troubleshooting.
๐ Advanced Usage
Custom Aggregation
from qdrant_vector_aggregator import aggregate_embeddings
from qdrant_client.models import Distance
# PCA-based aggregation with custom settings
aggregate_embeddings(
input_collection_name="source_collection",
column_name="metadata.category",
output_collection_name="aggregated_collection",
method="pca",
distance_metric=Distance.COSINE,
qdrant_url="https://your-cluster.cloud.qdrant.io",
api_key="your-api-key"
)
Weighted Average
# Aggregate with custom weights (e.g., by chunk importance)
aggregate_embeddings(
input_collection_name="source_collection",
column_name="metadata.document_id",
output_collection_name="weighted_collection",
method="weighted_average",
weights=[0.5, 0.3, 0.2] # Weights for first 3 chunks
)
Attention-Based Pooling
# Context-aware aggregation
aggregate_embeddings(
input_collection_name="source_collection",
column_name="metadata.document_id",
output_collection_name="attention_collection",
method="attentive_pooling"
)
๐ Searching Aggregated Collections
from qdrant_client import QdrantClient
client = QdrantClient(url="your-url", api_key="your-key")
# Search the aggregated collection
results = client.search(
collection_name="aggregated_collection",
query_vector=your_query_embedding, # 1536-dim vector
limit=5
)
# Each result now represents a complete document
for result in results:
print(f"Document: {result.payload['metadata']['name']}")
print(f"Score: {result.score}")
print(f"Chunks: {result.payload['chunk_count']}")
print(f"Content: {result.payload['page_content'][:200]}...")
๐ Project Structure
qdrant_vector_aggregator/
โโโ .env # Your credentials (not in git)
โโโ .env.example # Template
โโโ .gitignore # Git ignore rules
โโโ README.md # This file
โโโ SETUP_INSTRUCTIONS.md # Detailed setup guide
โโโ LICENSE # MIT License
โโโ setup.py # Installation script
โ
โโโ qdrant_vector_aggregator/ # Main package
โ โโโ __init__.py # Package initialization
โ โโโ aggregator.py # Core aggregation logic
โ โโโ config.py # Configuration management
โ โโโ embedding_methods.py # All 14 aggregation methods
โ โโโ qdrant_collection_helpers.py # Qdrant utilities
โ โโโ utils.py # Helper functions
โ
โโโ test_connection.py # Connection testing
โโโ example_usage.py # Usage examples
โโโ aggregate_conventions.py # Working example
โโโ debug_aggregation.py # Debugging tool
โโโ verify_aggregation.py # Verification tool
๐ Real-World Example
From the included aggregate_conventions.py:
from qdrant_vector_aggregator import aggregate_embeddings
# Aggregate 2,707 convention chunks into 114 documents
result = aggregate_embeddings(
input_collection_name="conventions_cadre_sectorielles",
column_name="metadata.metadata.name", # Nested field
output_collection_name="conventions_aggregated",
method="average"
)
# Results:
# โ
Input: 2,707 chunks
# โ
Output: 114 documents
# โ
Compression: 23.75x
# โ
Content: 100% concatenated in order
# โ
Average length: 37,669 characters per document
๐ง Troubleshooting
Connection Issues
# Test your connection
python3 test_connection.py
Timeout Errors
The aggregator uses batch processing (100 points per batch) to prevent timeouts. For very large collections, you can adjust the batch size in utils.py.
Content Not Concatenating
Run the verification tool to check:
python3 verify_aggregation.py
This will show:
- Which ordering field was detected (if any)
- How many documents have concatenated content
- Average content length
๐ Requirements
- Python 3.7+
- qdrant-client
- numpy
- scikit-learn
- python-dotenv
๐ค Contributing
Contributions are welcome! Feel free to:
- Add new aggregation methods
- Improve content concatenation logic
- Add more examples
- Report issues
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
Based on the original faiss_vector_aggregator project, adapted for Qdrant with enhanced features including smart content concatenation.
๐ Repository
GitHub: qdrant_vector_aggregator
๐ Support
For issues or questions:
- Check
SETUP_INSTRUCTIONS.mdfor detailed setup help - Run
debug_aggregation.pyfor troubleshooting - Review the example scripts for usage patterns
Made with โค๏ธ for the Qdrant community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qdrant_vector_aggregator-1.0.0.tar.gz.
File metadata
- Download URL: qdrant_vector_aggregator-1.0.0.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a592668ee32d61658e57837bddb9b25f44f3ba1e8e2612203c368c18578df7d
|
|
| MD5 |
c6b588da0b0cf82958b3127accc11e80
|
|
| BLAKE2b-256 |
5fbcc008c5d2351be595774f81492a1acd960023c455d50134a579f944808286
|
File details
Details for the file qdrant_vector_aggregator-1.0.0-py3-none-any.whl.
File metadata
- Download URL: qdrant_vector_aggregator-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2d80cdf7fe9835d779c86f53f680fa99c0da3fc2dadc1d0b2f82418897c22cc
|
|
| MD5 |
7bde35202037095f29d37f3594e20681
|
|
| BLAKE2b-256 |
6932e3588e47c34806d7501bc0c27d85fb7ab2ceabcb605e15f63f47c1ffd40e
|