Skip to main content

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

Project description

Motivation | Features | Documentation | Leaderboard | Citing

☁️ Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

  • Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
  • Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.

☁️ Features

  • 🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
  • 🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
  • 🔄 Heterogeneous and Dynamic. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.

☁️ Versions

We plan to release new test datasets on regular basis. The latest version is AIR-Bench_24.05.

Version Release Date # of domains # of languages # of datasets Details
AIR-Bench_24.05 Oct 17, 2024 9 [1] 13 [2] 69 here
AIR-Bench_24.04 May 21, 2024 8 [3] 2 [4] 28 here

[1] wiki, web, news, healthcare, law, finance, arxiv, book, science.

[2] en, zh, es, fr, de, ru, ja, ko, ar, fa, id, hi, bn (English, Chinese, Spanish, French, German, Russian, Japanese, Korean, Arabic, Persian, Indonesian, Hindi, Bengali).

[3] wiki, web, news, healthcare, law, finance, arxiv, book.

[4] en, zh (English, Chinese).

For the differences between different versions, please refer to here.

☁️ Results

You could check out the results at AIR-Bench Leaderboard. Detailed results are available in eval_results.

Some brief analysis results are available here. The technical report is coming soon. Please stay tuned for updates!

☁️ Usage

Installation

This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark.

pip install air-benchmark

Evaluations

Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).

  1. Run evaluations

    • See the scripts to run evaluations on AIR-Bench for your models.
  2. Submit search results (Only for test set)

    • Package the output files

      • As for the results without reranking models,
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --save_dir search_results
      
      • As for the results with reranking models
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --reranker_name [YOUR_RERANKING_MODEL] \
      --save_dir search_results
      
    • Upload the output .zip and fill in the model information at AIR-Bench Leaderboard

☁️ Documentation

Documentation
🏭 Pipeline The data generation pipeline of AIR-Bench
📋 Tasks Overview of available tasks in AIR-Bench
📈 Leaderboard The interactive leaderboard of AIR-Bench
🚀 Submit Information related to how to submit a model to AIR-Bench
🤝 Contributing How to contribute to AIR-Bench

☁️ Acknowledgement

This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.

☁️ Citing

The technical report is coming soon. Please stay tuned for updates!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

air_benchmark-0.1.0.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

air_benchmark-0.1.0-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file air_benchmark-0.1.0.tar.gz.

File metadata

  • Download URL: air_benchmark-0.1.0.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.12

File hashes

Hashes for air_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6cd40c86d03ed7ba805a934582911fa39afed6269290ef0bab71efcd43ede137
MD5 a64719d1e6b9db509ed437e7a150da1e
BLAKE2b-256 df15dce34be9b2f304880bf132a4da27eff51a71778cd75ff94352cf16cfbeb9

See more details on using hashes here.

File details

Details for the file air_benchmark-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for air_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e619d3cfe1d9a5a434e9fad8e9dba3f0a569961e66a6a4091e49ab13e4d4f37f
MD5 be86ebcb9c504abb1e5a0f15df424ae1
BLAKE2b-256 e48b6e4732d2367a63c2cf36e7e1b2703c7f11d63232f2e80ad2a6a51e0160ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page