Skip to main content

Visualize and evaluate rank evolution across retrieval steps - rankflow plots, retrieval metrics, and pipeline comparison for RAG.

Project description

RankFlow

PyPI version Python versions License Downloads

See exactly how each document moves through your retrieval pipeline -- and measure whether re-ranking actually helps.

RankFlow visualizes and evaluates rank evolution across retrieval and re-ranking steps in RAG pipelines. Instead of staring at log files, you get a clear picture of what each pipeline stage does to your search results.

RankFlow

Why RankFlow?

  • Your re-ranker might be making things worse -- RankFlow shows per-step metrics (NDCG, MRR, Precision, Recall, MAP) so you know exactly which stage helps and which hurts.
  • You can't optimize what you can't see -- rank evolution plots reveal how documents shuffle between stages, which ones get dropped, and where relevant results fall off.
  • Evaluating one query isn't enough -- BatchRankFlow aggregates across hundreds of queries with dashboards, win/loss analysis, and automatic failure detection.
  • Hybrid search is a black box -- MergeRankFlow models branching pipelines (BM25 + vector search), showing overlap between branches and whether they're complementary or redundant.
  • Experiment tracking shouldn't require MLflow -- save, load, and statistically compare retrieval experiments with a file-based store. No database, no infra.

Quick Start

pip install rankflow
import numpy as np
from rankflow import RankFlow

rf = RankFlow(
    ranks=np.array([[3, 0, 5, 1, 2], [1, 2, 4, 0, 3], [0, 3, 6, 1, 2]]),
    step_labels=["BM25", "Semantic", "Cross-Encoder"],
    chunk_labels=["doc_a", "doc_b", "doc_c", "doc_d", "doc_e"],
    relevant_chunks=["doc_a", "doc_c"],  # ground truth
    show_metrics=True,
)
rf.plot()

Relevant documents are highlighted, irrelevant ones fade out, and retrieval metrics are annotated at each step.

More examples

Density mode (100+ docs) Batch evaluation dashboard Hybrid pipeline (MergeRankFlow)
Density Dashboard Merge

Features

Visualization

Feature Description
Rank evolution plot Line plot showing how each document's rank changes across pipeline stages
Density mode Percentile bands for 100+ documents with focus lines for top-K and relevant docs
Interactive Plotly Hover tooltips with document name, rank, score, and delta
A/B comparison Side-by-side rendering of two pipelines
Source provenance Distinct markers/colors for text vs. vector vs. hybrid results
NaN handling Dashed lines for documents that appear or disappear between steps

Evaluation

Feature Description
Per-step metrics Precision@K, Recall@K, MRR, NDCG@K, MAP computed at every pipeline stage
Graded relevance Numeric relevance grades with colormap visualization
Rank deltas Annotate rank changes between consecutive steps
Per-chunk summary Initial/final rank, max gain, max loss, total displacement
Top-K filtering Focus on top-K documents (modes: any, initial, final)

Multi-Query Batch Evaluation

Feature Description
Aggregated metrics Mean/std across all queries per step
Metrics dashboard Box plots for all metrics in a single figure
Win/loss/tie analysis Count queries that improved or degraded at each transition
Difficulty segmentation Group queries by difficulty, compare metric evolution per bucket
Improvement heatmap Queries x transitions matrix colored by metric delta
Failure detection Automatically find queries where the pipeline made results worse

Hybrid Pipelines

Feature Description
DAG pipeline model MergeRankFlow for branching/merging pipelines (BM25 + vector + RRF)
Overlap analysis Shared vs. exclusive documents at merge points
Rank correlation Spearman rho between branches -- are they complementary or redundant?
Branch visualization Distinct line styles per source branch with legend

Experiment Tracking

Feature Description
Experiment registry ExperimentStore saves/loads experiments as JSON files -- no database needed
Pipeline config Attach pipeline_config dicts to RankFlow, preserved through export/import
Experiment comparison compare_experiments() with config diff, per-metric deltas, paired t-test p-values
Win/loss/tie Count queries that improved, degraded, or stayed the same
Regression detection Identify specific queries where the challenger is worse
Web UI Interactive Streamlit dashboard for browsing and comparing experiments

Ecosystem Integration

Format Import Export
TREC run/qrels RankFlow.from_trec_run() rf.to_trec_run()
RankFlow JSON RankFlow.from_rankflow_json() rf.to_rankflow_json()
ranx RankFlow.from_ranx() rf.to_ranx_run()
RAGAS RankFlow.from_ragas() --
pandas RankFlow(df=df) rf.to_dataframe()
dict / JSON -- rf.to_dict() / rf.to_json()

Examples

Metrics at every step

rf = RankFlow(
    ranks=ranks,
    step_labels=["BM25", "Semantic", "Cross-Encoder"],
    chunk_labels=chunk_labels,
    relevant_chunks=["doc_a", "doc_c"],
    relevance_grades={"doc_a": 3, "doc_c": 1},
)

for step, m in zip(rf.step_labels, rf.metrics(k=5)):
    print(f"{step}: NDCG@5={m['ndcg_at_k']:.3f}  MRR={m['mrr']:.3f}")

Batch evaluation across 100 queries

from rankflow import BatchRankFlow

batch = BatchRankFlow(rankflows)  # list of RankFlow objects
batch.plot_dashboard(k=10)        # box plots for all metrics

# Find queries where re-ranking hurt performance
for f in batch.failure_cases(metric="ndcg_at_k", k=10, threshold=-0.1):
    print(f"{f['query_label']}: {f['initial_value']:.3f} -> {f['final_value']:.3f}")

Density plot for large result sets

rf = RankFlow(
    ranks=ranks_150_docs,
    step_labels=step_labels,
    chunk_labels=chunk_labels,
    relevant_chunks=relevant,
    density_focus_k=10,
)
rf.plot(mode="density")

Compare two experiments

from rankflow import Experiment, ExperimentStore, compare_experiments

store = ExperimentStore("./experiments")
baseline = store.load("bm25-baseline")
challenger = store.load("cross-encoder-v1")

report = compare_experiments(baseline, challenger, k=10)
print(f"Win/Loss/Tie: {report.wins}W / {report.losses}L / {report.ties}T")

for metric, data in report.metric_deltas.items():
    sig = "*" if data["p_value"] < 0.05 else ""
    print(f"  {metric}: {data['delta']:+.3f} (p={data['p_value']:.3f}) {sig}")

Load from TREC format

rf = RankFlow.from_trec_run(
    ["bm25.run", "reranker.run"],
    qrels_path="qrels.txt",
    query_id="q1",
)
rf.plot()

Tutorials

Step-by-step notebooks in notebooks/:

  1. Quick Start -- creating plots from arrays and DataFrames, customizing appearance
  2. Metrics & Analysis -- relevance highlighting, IR metrics, rank deltas, top-K filtering
  3. Advanced Visualization -- A/B comparison, density plots, source provenance, Plotly
  4. Batch Evaluation -- multi-query aggregation, dashboards, win/loss, failure detection
  5. Adapters & Export -- TREC/RAGAS/ranx import, JSON export, hybrid pipeline modeling
  6. Experiments & Comparison -- experiment registry, statistical comparison, regression detection
  7. Web UI Walkthrough -- synthetic data generation, launching the Streamlit dashboard, page-by-page guide

Installation

pip install rankflow              # core (matplotlib + numpy)
pip install rankflow[pandas]      # + pandas support
pip install rankflow[interactive] # + Plotly interactive plots
pip install rankflow[all]         # everything

For adapter and UI dependencies (optional):

pip install rankflow[ranx]        # ranx IR evaluation library
pip install rankflow[ragas]       # RAGAS RAG evaluation framework
pip install rankflow[ui]          # Streamlit web UI

Web UI

Launch the interactive experiment explorer:

rankflow ui ./experiments

The UI provides four views: experiment list, comparison dashboard with config diff and significance tests, per-query explorer with drill-down, and full BatchRankFlow deep dive.

Further Reading

RankFlow plot for retriever visual evaluation -- a blog post describing practical usage of RankFlow for tuning RAG retrieval pipelines.

License

MIT © Krystian Safjan.


If RankFlow helps your retrieval work, consider giving it a ⭐ on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rankflow-0.2.1.tar.gz (47.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rankflow-0.2.1-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file rankflow-0.2.1.tar.gz.

File metadata

  • Download URL: rankflow-0.2.1.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rankflow-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0ee2409a0c2a76284f788f7219ab3baf92ebc1507a29a428a21d326d697d64a7
MD5 cade548dddaf05694639ce6fc015e584
BLAKE2b-256 58a0d7b6e20dd0d601726db0d6d067a825be528d0f6462b0a55517021c6c90d2

See more details on using hashes here.

File details

Details for the file rankflow-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: rankflow-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 47.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rankflow-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 be1420782a452ce1dca51cd1b482317131ce87f68afd4d871983bee7971bfa4d
MD5 3e3983c2666610fd526cd38248d64c35
BLAKE2b-256 a803efc8750b73863c2845d928a5880208a1546f9217fcd9b883320a98ea0112

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page