Skip to main content

Ultra lite & Super fast SoTA cross-encoder based re-ranking for your search & retrieval pipelines.

Project description

🏎️ FlashRank

Ultra-lite & Super-fast Python library to add re-ranking to your existing search & retrieval pipelines. It is based on SoTA cross-encoders.

  1. Ultra-lite:

    • No Torch or Transformers needed. Runs on CPU.
    • Models as small as ~17MB.
  2. ⏱️ Super-fast:

    • Rerank speed is a function of # of tokens in passages, query + model depth (layers)
    • To give an idea, Time taken by the example (in code) using the default model is below.
    • Detailed benchmarking, TBD
  3. 💸 $ concious:

    • Lowest $ per invocation: Serverless deployments like Lambda are charged by memory & time per invocation*
    • Smaller package size = lesser cold starts, quicker re-deployments.
  4. 🎯 Based on SoTA Cross-encoders:

    • Below are the list of models supported as now.
      • ms-marco-TinyBERT-L-2-v2 (default)
      • ms-marco-MiniLM-L-12-v2
    • Why only sleeker models? Reranking is the final leg of larger retrieval pipelines, idea is to avoid any extra overhead especially for user-facing scenarios. To that end models with really small footprint that doesn't need any specialised hardware and yet offer competitive performance are chosen. Feel free to raise issues to add support for a new models as you see fit.

🚀 Installation:

pip install flashrank

Usage:

from flashrank.Ranker import Ranker
# Default blazing fast model and competitive performance.
ranker = Ranker()

or 
# For a larger (slower model) with slightly better performance.
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")
query = "Tricks to accelerate LLM inference"
passages = [
    "Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.",
    "LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper",
    "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face.  This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second.  - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint.  - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. ",
    "Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. ",
    "vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels"
]
results = ranker.rerank(query, passages)
print(results)
[{'score': 0.99806124, 'passage': 'Introduce *lookahead decoding*: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.'}, 
{'score': 0.95966834, 'passage': "There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face.  This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call `model.to_bettertransformer()` on your Hugging Face model for a modest improvement in tokens per second.  - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint.  - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run. "}, 
{'score': 0.620731, 'passage': 'vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels'}, 
{'score': 0.56146526, 'passage': 'LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper'}, 
{'score': 0.098350815, 'passage': 'Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup. '}]

You can use it with any search & retrieval pipeline:

  1. Lexical Search (RegularDBs that supported full-text search or Inverted Index)

  1. Semantic Search / RAG usecases (VectorDBs)

  1. Hybrid Search

Deployment patterns

How to use it in a AWS Lambda function ?

  • TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FlashRank-0.1.2.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

FlashRank-0.1.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file FlashRank-0.1.2.tar.gz.

File metadata

  • Download URL: FlashRank-0.1.2.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for FlashRank-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8da646d9400a063ded6395c9c9172b31d66abc442e3cef4abab542b909090169
MD5 afac921ce93f39160ced2957509df910
BLAKE2b-256 092c820e62fd5dd90483b7a75406d953f298ba57f934583f5189362e93fd877b

See more details on using hashes here.

Provenance

File details

Details for the file FlashRank-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: FlashRank-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for FlashRank-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 276ba6d531fb833d181921479cea6d502406d5cac3c009639aec5f8f6dae94f1
MD5 2af75fb74c8069051626cd0ae5c473ba
BLAKE2b-256 98c3843eeaa5988f245eee6a3cf6f9369b3736eafb3793675169b119867b05a7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page