A library for aggregating and re-ranking information from multi-source federated retrieval systems
Project description
Federated Aggregation
A Python library for aggregating and re-ranking information from multi-source federated retrieval systems. Combine search results from multiple sources with different embedding models and ranking systems into a unified, ranked output.
Installation
pip install federated_aggregation
Or install from source:
git clone https://github.com/your-repo/federated_aggregation.git
cd federated_aggregation
pip install -r requirements.txt
pip install .
Dependencies
numpy- Numerical computingfastembed- Fast embedding model inferencescikit-learn- Machine learning utilitiesrank-bm25- BM25 ranking algorithm
Quick Start
from federated_aggregation import Aggregate
# Initialize the aggregator
aggregator = Aggregate()
# Your retrieved results from multiple sources
retrieved_nodes = {
"source_1": {
"sources": [
{"document": {"content": "Document text..."}, "score": 0.95},
{"document": {"content": "Another document..."}, "score": 0.87},
],
"query_embedding": [...], # Optional, required for procrustes
"embedding_model_name": "BAAI/bge-small-en-v1.5",
"similarity_metric": "cosine"
},
"source_2": {
"sources": [...],
"query_embedding": [...],
...
}
}
# Perform aggregation using method constants (recommended)
results = aggregator.perform_aggregation(
query="your search query",
retrieved_nodes=retrieved_nodes,
method=Aggregate.PROCRUSTES, # or a list: [Aggregate.PROCRUSTES, Aggregate.RRP_BM25]
top_k=10
)
# Access results
for doc in results["procrustes"]["reranked_nodes"]:
print(f"Score: {doc['score']}, Content: {doc['document']['content'][:100]}")
print(f"Time taken: {results['procrustes']['time_taken']:.3f}s")
Method Constants
Use class-level constants to specify aggregation methods (similar to OpenCV's cv2.COLOR_BGR2GRAY style):
| Constant | Value | Description |
|---|---|---|
Aggregate.CENTRAL_REEMBEDDING |
"central_re_embedding" |
Central re-embedding method |
Aggregate.RRP_BM25 |
"rrp_bm25" |
BM25 re-ranking method |
Aggregate.NAIVE_TOPK |
"naive_topk" |
Naive top-k aggregation |
Aggregate.PROCRUSTES |
"procrustes" |
Procrustes alignment method |
# Recommended: Use constants
results = aggregator.perform_aggregation(query, nodes, method=Aggregate.PROCRUSTES)
# Multiple methods
results = aggregator.perform_aggregation(
query, nodes,
method=[Aggregate.PROCRUSTES, Aggregate.RRP_BM25, Aggregate.CENTRAL_REEMBEDDING]
)
Aggregation Methods
1. Central Re-Embedding (central_re_embedding)
Re-embeds all retrieved documents using a unified central embedding model and ranks by cosine similarity to the query.
Best for: When you don't trust the original embeddings from federated sources or want a consistent embedding space.
results = aggregator.perform_aggregation(
query="search query",
retrieved_nodes=retrieved_nodes,
method=Aggregate.CENTRAL_REEMBEDDING,
top_k=10,
model_name="BAAI/bge-small-en-v1.5", # FastEmbed compatible model
device="cpu" # or "cuda"
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | 10 | Number of top results to return |
model_name |
str | "BAAI/bge-small-en-v1.5" | FastEmbed model identifier |
device |
str | "cpu" | Inference device ("cpu" or "cuda") |
How it works:
- Initializes a FastEmbed TextEmbedding model
- Computes fresh embeddings for all document texts
- Computes query embedding
- Ranks documents by cosine similarity between query and document embeddings
2. BM25 Re-Ranking (rrp_bm25)
Uses the BM25 algorithm (Okapi variant) for lexical/keyword-based ranking of aggregated documents.
Best for: Text-heavy documents where keyword matching is important; works without relying on embeddings.
results = aggregator.perform_aggregation(
query="search query",
retrieved_nodes=retrieved_nodes,
method=Aggregate.RRP_BM25,
top_k=10
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | 10 | Number of top results to return |
How it works:
- Tokenizes all documents and query by whitespace
- Builds a BM25Okapi model on the document corpus
- Scores all documents against the query
- Returns top-k documents by BM25 score
3. Procrustes Alignment (procrustes)
Aligns embedding spaces from different sources to a common anchor space using orthogonal Procrustes transformation. Enables meaningful comparison across sources that use different embedding models.
Best for: Federated systems where sources use different embedding models with potentially different dimensionalities.
results = aggregator.perform_aggregation(
query="search query",
retrieved_nodes=retrieved_nodes,
method=Aggregate.PROCRUSTES,
top_k=10,
apply_scaling=True # Normalize scores per source
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | 10 | Number of top results to return |
apply_scaling |
bool | True | Apply min-max normalization per source |
How it works:
- Selects a random source as the anchor embedding space
- For each other source, computes an orthogonal Procrustes transformation matrix
- Projects all document embeddings into the anchor space
- Computes cosine similarity with the anchor query embedding
- Optionally applies min-max scaling per source for fair comparison
- Returns globally ranked top-k documents
Requirements: Each source must include query_embedding for alignment computation.
4. Naive Top-K (naive_topk)
Simple aggregation that sorts all documents by their original confidence scores.
Best for: Baseline comparison; quick aggregation when original scores are trustworthy.
results = aggregator.perform_aggregation(
query="search query",
retrieved_nodes=retrieved_nodes,
method=Aggregate.NAIVE_TOPK,
top_k=10
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
top_k |
int | 10 | Number of top results to return |
How it works:
- Collects all documents from all sources
- Sorts by the existing
scorefield (descending) - Returns top-k documents
Running Multiple Methods
Compare different aggregation strategies by passing a list of methods:
results = aggregator.perform_aggregation(
query="search query",
retrieved_nodes=retrieved_nodes,
method=[Aggregate.PROCRUSTES, Aggregate.RRP_BM25, Aggregate.CENTRAL_REEMBEDDING, Aggregate.NAIVE_TOPK],
top_k=10
)
# Compare results
for method_name, result in results.items():
print(f"\n{method_name}:")
print(f" Time: {result['time_taken']:.3f}s")
print(f" Top result: {result['reranked_nodes'][0]['document']['content'][:50]}...")
Input Data Format
The retrieved_nodes parameter expects the following structure:
{
"source_name": {
"sources": [
{
"document": {
"content": "The document text content...",
# ... other document fields
},
"document_embedding": [0.1, 0.2, ...], # Optional
"score": 0.95 # Required for naive_topk
},
# ... more documents
],
"query_embedding": [0.1, 0.2, ...], # Required for procrustes
"document_embeddings": [[...], [...]], # Optional
"embedding_model_name": "model-name",
"similarity_metric": "cosine"
},
# ... more sources
}
Output Format
All methods return a dictionary with:
{
"method_name": {
"reranked_nodes": [
{
"document": {...},
"score": 0.95,
# Additional fields depending on method
},
# ... top_k documents
],
"time_taken": 0.123 # Execution time in seconds
}
}
Method Comparison
| Method | Embedding Required | Handles Multi-Dimensional | Speed | Use Case |
|---|---|---|---|---|
central_re_embedding |
No (re-computes) | Yes | Slow | Unified embedding space |
rrp_bm25 |
No | N/A | Fast | Keyword-based ranking |
procrustes |
Yes (with query) | Yes | Medium | Cross-model alignment |
naive_topk |
No | N/A | Fastest | Baseline/quick aggregation |
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file federated_aggregation-0.1.0.tar.gz.
File metadata
- Download URL: federated_aggregation-0.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2a098a66aa19d7fdf23594cdcf6b4ae037d7f5191878b469d5aa1f5b93a7251
|
|
| MD5 |
15f3ad1a2910a4633ffb6fd2e02f666a
|
|
| BLAKE2b-256 |
c19cd9aa4221146bb9fce80806770a9c8bdd91e544bdc375a0562a7ea1bc8f1b
|
File details
Details for the file federated_aggregation-0.1.0-py3-none-any.whl.
File metadata
- Download URL: federated_aggregation-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00bbbb278bf79d458d8b1ce03b4b93c84d3d2d81442207900617a762d0239ed8
|
|
| MD5 |
21ee77131e950bb4618cbbfc2a7d2a45
|
|
| BLAKE2b-256 |
ff6b6c8c2e85a06368b17d5fda9c0fa3e64ce4913f02fb624b675b8c2870281e
|