Skip to main content

Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Project description

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

License: MIT

NEWS

  • [Oct 2024] Leaderboard: We construct the official leaderboard on Hunggingface and we are calling for submissions!
  • [Oct 2024] Camera-ready paper is out! We add multiple retrieval models including BM25, Colbertv2, GritLM.
  • [Sep 2024] STaRK is accepted to 2024 NeurIPS Dataset & Benchmark Track!
  • [Jun 2024] We make our benchmark as a pip package stark-qa. You can directly load the data from the package now!
  • [Jun 2024] We migrate our data to Hugging Face! You don't need to change anything, the data will be automatically downloaded.
  • [May 2024] We have augmented our benchmark with three high-quality human-generated query datasets which are open to access. See more details in our updated arxiv!
  • [May 9th 2024] We release STaRK SKB Explorer, an interactive interface for you to explore our knowledge bases!
  • [May 7th 2024] We present STaRK in the 2024 Stanford Annual Affiliates Meeting and 2024 Stanford Data Science Conference.
  • [May 5th 2024] STaRK was reported on Marketpost and 智源社区 BAAI. Thanks for writing about our work!
  • [Apr 21st 2024] We release the STaRK benchmark.

What is STaRK?

STaRK is a large-scale Semi-structured Retrieval Benchmark on Textual and Relational Knowledge bases, covering applications in product search, academic paper search, and biomedicine inquiries.

Featuring diverse, natural-sounding, and practical queries that require context-specific reasoning, STaRK sets a new standard for assessing real-world retrieval systems driven by LLMs and presents significant challenges for future research.

🔥 Check out our website for more overview!

Access benchmark data

1) Env Setup

From pip (recommended)

With python >=3.8 and <3.12

pip install stark-qa

From source

Create a conda env with python >=3.8 and <3.12 and install required packages in requirements.txt.

conda create -n stark python=3.11
conda activate stark
pip install -r requirements.txt

2) Data loading

from stark_qa import load_qa, load_skb

dataset_name = 'amazon'

# Load the retrieval dataset
qa_dataset = load_qa(dataset_name)
idx_split = qa_dataset.get_idx_split()

# Load the semi-structured knowledge base
skb = load_skb(dataset_name, download_processed=True, root=None)

The root argument for load_skb specifies the location to store SKB data. With default value None, the data will be stored in huggingface cache.

Data of the Retrieval Task

Question answer pairs for the retrieval task will be automatically downloaded in data/{dataset}/stark_qa by default. We provided official split in data/{dataset}/split.

Data of the Knowledge Bases

There are two ways to load the knowledge base data:

  • (Recommended) Instant downloading: The knowledge base data of all three benchmark will be automatically downloaded and loaded when setting download_processed=True.
  • Process data from raw: We also provided all of our preprocessing code for transparency. Therefore, you can process the raw data from scratch via setting download_processed=False. In this case, STaRK-PrimeKG takes around 5 minutes to download and load the processed data. STaRK-Amazon and STaRK-MAG may takes around an hour to process from the raw data.

3) Evaluation on benchmark

If you are running eval, you may install the following packages:

pip install llm2vec gritlm bm25
  • Our evaluation requires embed the node documents into candidate_emb_dict.pt, which is a dictionary node_id -> torch.Tensor. Query embeddings will be automatically generated if not available. You can either run the following the python script to download query embeddings and document embeddings generated by text-embedding-ada-002. (We provide them so you can run on our benchmark right away.)

    python emb_download.py --dataset amazon --emb_dir emb/
    

    Or you can run the following code to generate the query or document embeddings by yourself. E.g.,

    python emb_generate.py --dataset amazon --mode query --emb_dir emb/ --emb_model text-embedding-ada-002
    
    • dataset: one of amazon, mag or prime.
    • mode: the content to embed, one of query or doc (node documents).
    • emb_dir: the directory to store embeddings.
    • emb_model: the LLM name to generate embeddings, such as text-embedding-ada-002, text-embedding-3-large, , voyage-large-2-instruct, GritLM/GritLM-7B, McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp
    • See emb_generate.py for other arguments.
  • Run the python script for evaluation. E.g.,

    python eval.py --dataset amazon --model VSS --emb_dir emb/ --output_dir output/ --emb_model text-embedding-ada-002 --split test --save_pred 
    
    python eval.py --dataset amazon --model VSS --emb_dir emb/ --output_dir output/ --emb_model GritLM/GritLM-7B --split test-0.1 --save_pred 
    
    python eval.py --dataset amazon --model LLMReranker --emb_dir emb/ --output_dir output/ --emb_model text-embedding-ada-002 --split human_generated_eval --llm_model gpt-4-1106-preview --save_pred
    

    Key args:

    • dataset: the dataset to evaluate on, one of amazon, mag or prime.
    • model: the model to be evaluated, one of BM25, Colbertv2, VSS, MultiVSS, LLMReranker.
      • Please specify the name of embedding model with argument --emb_model.
      • If you are using LLMReranker, please specify the LLM name with argument --llm_model.
      • Specify API keys in command line
        export ANTHROPIC_API_KEY=YOUR_API_KEY
        
        or
        export OPENAI_API_KEY=YOUR_API_KEY
        export OPENAI_ORG=YOUR_ORGANIZATION
        
        or
        export VOYAGE_API_KEY=YOUR_API_KEY
        
    • emb_dir: the directory to store embeddings.
    • split: the split to evaluate on, one of train, val, test, test-0.1 (10% random sample), and human_generated_eval (to be evaluated on the human generated query dataset).
    • output_dir: the directory to store evaluation outputs.
    • surfix: Specify when the stored embeddings are in folder doc{surfix} or query{surfix}, e.g., _no_compact,

Reference

Please consider citing our paper if you use our benchmark or code in your work:

@inproceedings{wu24stark,
    title        = {STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases},
    author       = {
        Shirley Wu and Shiyu Zhao and 
        Michihiro Yasunaga and Kexin Huang and 
        Kaidi Cao and Qian Huang and 
        Vassilis N. Ioannidis and Karthik Subbian and 
        James Zou and Jure Leskovec
    },
    booktitle    = {NeurIPS Datasets and Benchmarks Track},
    year         = {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stark_qa-0.1.3.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

stark_qa-0.1.3-py3-none-any.whl (64.0 kB view details)

Uploaded Python 3

File details

Details for the file stark_qa-0.1.3.tar.gz.

File metadata

  • Download URL: stark_qa-0.1.3.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for stark_qa-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5daff6ee14f2ee6c095307f889c8f30803793517845dcf2e810c533f722b6cdd
MD5 b75bd6084788b37c6c5444a0adb2cf8d
BLAKE2b-256 57e519f8fb753c36f9ea6d052e3062de3669d13b6245cebd0554f80c5f0bae4e

See more details on using hashes here.

File details

Details for the file stark_qa-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: stark_qa-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 64.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for stark_qa-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e146cc5c752417c3756561f481c2b051ae9db4fc23ad9e7e0fb827e0115efe99
MD5 4e84d52d0bfd0e65ece8ca6b3458535a
BLAKE2b-256 eb5755fe883261996bbad3704a74529fbc4de8fc34574fad27f3c0415f23768a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page