FlashRank

Ultra lite & Super fast SoTA cross-encoder based re-ranking for your search & retrieval pipelines.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Re-rank your search results with SoTA Pairwise or Listwise rerankers before feeding into your LLMs

Ultra-lite & Super-fast Python library to add re-ranking to your existing search & retrieval pipelines. It is based on SoTA LLMs and cross-encoders, with gratitude to all the model owners.

Supports:

Pairwise / Pointwise rerankers. (Cross encoder based, i.e. Max tokens = 512)
Listwise LLM based rerankers. (LLM based, i.e. Max tokens = 8192)
See below for full list of supported models.

Features
Installation
Making ranking faster
Getting started
Deployment patterns
How to Cite?
Papers citing flashrank

Features

⚡ Ultra-lite:
- No Torch or Transformers needed. Runs on CPU.
- Boasts the tiniest reranking model in the world, ~4MB.
⏱️ Super-fast:
- Rerank speed is a function of # of tokens in passages, query + model depth (layers)
- To give an idea, Time taken by the example (in code) using the default model is below.
- Detailed benchmarking, TBD
💸 $ concious:
- Lowest $ per invocation: Serverless deployments like Lambda are charged by memory & time per invocation*
- Smaller package size = shorter cold start times, quicker re-deployments for Serverless.
🎯 Based on SoTA Cross-encoders and other models:
- "How good are Zero-shot rerankers?" - look at the reference section.

Model Name	Description	Size	Notes
`ms-marco-TinyBERT-L-2-v2`	Default model	~4MB	Model card
`ms-marco-MiniLM-L-12-v2`	`Best Cross-encoder reranker`	~34MB	Model card
`rank-T5-flan`	Best non cross-encoder reranker	~110MB	Model card
`ms-marco-MultiBERT-L-12`	Multi-lingual, supports 100+ languages	~150MB	Supported languages
`ce-esci-MiniLM-L12-v2`	Fine-tuned on Amazon ESCI dataset	-	Model card
`rank_zephyr_7b_v1_full`	4-bit-quantised GGUF	~4GB	Model card
`miniReranker_arabic_v1`	`Only dedicated Arabic Reranker`	-	Model card

Models in roadmap:
- InRanker
Why sleeker models are preferred ? Reranking is the final leg of larger retrieval pipelines, idea is to avoid any extra overhead especially for user-facing scenarios. To that end models with really small footprint that doesn't need any specialised hardware and yet offer competitive performance are chosen. Feel free to raise issues to add support for a new models as you see fit.

Installation:

If you need lightweight pairwise rerankers [default]

pip install flashrank

If you need LLM based listwise rerankers

pip install flashrank[listwise]

Making ranking faster:

max_length value should be large able to accomodate your longest passage. In other words if your longest passage (100 tokens) + query (16 tokens) pair by token estimate is 116 then say setting max_length = 128 is good enough inclhuding room for reserved tokens like [CLS] and [SEP]. Use Openai tiktoken like libraries to estimate token density, if performance per token is critical for you. Non-chalantly giving a longer max_length like 512 for smaller passage sizes will negatively affect response time.

Getting started:

from flashrank import Ranker, RerankRequest

# Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker(max_length=128)

or 

# Small (~34MB), slightly slower & best performance (ranking precision).
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

or 

# Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.
ranker = Ranker(model_name="rank-T5-flan", cache_dir="/opt")

or 

# Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages  (don't use for english)
ranker = Ranker(model_name="ms-marco-MultiBERT-L-12", cache_dir="/opt")

or 

ranker = Ranker(model_name="rank_zephyr_7b_v1_full", max_length=1024) # adjust max_length based on your passage length

# Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.
query = "How to speedup LLMs?"
passages = [
   {
      "id":1,
      "text":"Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
      "meta": {"additional": "info1"}
   },
   {
      "id":2,
      "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
      "meta": {"additional": "info2"}
   },
   {
      "id":3,
      "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
      "meta": {"additional": "info3"}

   },
   {
      "id":4,
      "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.",
      "meta": {"additional": "info4"}
   },
   {
      "id":5,
      "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels",
      "meta": {"additional": "info5"}
   }
]

rerankrequest = RerankRequest(query=query, passages=passages)
results = ranker.rerank(rerankrequest)
print(results)

# Reranked output from default reranker
[
   {
      "id":4,
      "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.",
      "meta":{
         "additional":"info4"
      },
      "score":0.016847236
   },
   {
      "id":5,
      "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels",
      "meta":{
         "additional":"info5"
      },
      "score":0.011563735
   },
   {
      "id":3,
      "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.",
      "meta":{
         "additional":"info3"
      },
      "score":0.00081340264
   },
   {
      "id":1,
      "text":"Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
      "meta":{
         "additional":"info1"
      },
      "score":0.00063596206
   },
   {
      "id":2,
      "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
      "meta":{
         "additional":"info2"
      },
      "score":0.00024851
   }
]

You can use it with any search & retrieval pipeline:

Lexical Search (RegularDBs that supports full-text search or Inverted Index)

Semantic Search / RAG usecases (VectorDBs)

Hybrid Search

Deployment patterns

How to use it in a AWS Lambda function ?

In AWS or other serverless environments the entire VM is read-only you might have to create your own custom dir. You can do so in your Dockerfile and use it for loading the models (and eventually as a cache between warm calls). You can do it during init with cache_dir parameter.

ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

References:

In-domain and Zeroshot performance of Cross Encoders fine-tuned on MS-MARCO

In-domain and Zeroshot performance of RankT5 fine-tuned on MS-MARCO

How to Cite?

To cite this repository in your work please click the "cite this repository" link on the right side (bewlow repo descriptions and tags)

Papers citing flashrank

[IMPORTANT UPDATE]

~~A clone library called SwiftRank is pointing to our model buckets, we are working on a interim solution to avoid this stealing. Thank you for patience and understanding.~~

This issue is resolved, the models are in HF now. please upgrade to continue pip install -U flashrank. Thank you for patience and understanding

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.10

Jan 6, 2025

0.2.9

Aug 14, 2024

0.2.8

Jul 24, 2024

0.2.7

Jul 24, 2024

0.2.6

Jun 20, 2024

0.2.5

May 14, 2024

0.2.4

Apr 30, 2024

0.2.3

Apr 30, 2024

0.2.2

Apr 30, 2024

0.2.1

Apr 30, 2024

0.2.0

Mar 19, 2024

0.1.69

Mar 17, 2024

0.1.68

Mar 17, 2024

0.1.67

Mar 17, 2024

0.1.66

Jan 29, 2024

0.1.65

Jan 29, 2024

0.1.64

Dec 13, 2023

0.1.63

Dec 13, 2023

0.1.62

Dec 13, 2023

0.1.61

Dec 13, 2023

0.1.6

Dec 13, 2023

0.1.5

Dec 12, 2023

0.1.4

Dec 10, 2023

0.1.3

Dec 7, 2023

0.1.2

Dec 7, 2023

0.1.1

Dec 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FlashRank-0.2.10.tar.gz (18.9 kB view details)

Uploaded Jan 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

FlashRank-0.2.10-py3-none-any.whl (14.5 kB view details)

Uploaded Jan 6, 2025 Python 3

File details

Details for the file FlashRank-0.2.10.tar.gz.

File metadata

Download URL: FlashRank-0.2.10.tar.gz
Upload date: Jan 6, 2025
Size: 18.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for FlashRank-0.2.10.tar.gz
Algorithm	Hash digest
SHA256	`f8f82a25c32fdfc668a09dc4089421d6aab8e7f71308424b541f40bb3f01d9db`
MD5	`2fb36fadd38247e330f6cc2823b07e12`
BLAKE2b-256	`551f176cb4a857a70c3538f637e19389ab6aed21548a1ba1d1424fccc8bba108`

See more details on using hashes here.

File details

Details for the file FlashRank-0.2.10-py3-none-any.whl.

File metadata

Download URL: FlashRank-0.2.10-py3-none-any.whl
Upload date: Jan 6, 2025
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for FlashRank-0.2.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d3272ae657d793c132d1e7917ed9e2adf49e0e1c60735583a67b051c6f0434a`
MD5	`b9825caedd9560cbc00b90773785593a`
BLAKE2b-256	`ec9972639cc1c9221c5bc77a2df1c2d352fe11965553bdf7d3e0856e7fcc8fd6`

See more details on using hashes here.

FlashRank 0.2.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

Features

Installation:

If you need lightweight pairwise rerankers [default]

If you need LLM based listwise rerankers

Making ranking faster:

Getting started:

You can use it with any search & retrieval pipeline:

Deployment patterns

How to use it in a AWS Lambda function ?

References:

How to Cite?

Papers citing flashrank

[IMPORTANT UPDATE]

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes