Skip to main content

On-Demand Datasets for Reasoning and Retrieval Evaluation

Project description

PhantomWiki

PhantomWiki generates on-demand datasets to evaluate reasoning and retrieval capabilities of LLMs.

Contents

🚀 Quickstart

First install Prolog on your machine, then PhantomWiki with pip:

pip install phantom-wiki

[!NOTE] This package has been tested with Python 3.12. We require Python 3.10+ to support match statements.

To build from source, you can clone this repository and run pip install ..

Generate PhantomWiki datasets with random generation seed 1:

  1. In Python:
import phantom_wiki as pw

pw.generate_dataset(
    output_dir="/path/to/output",
    seed=1,
    use_multithreading=True,
)
  1. In a terminal:
phantom-wiki-generate -od "/path/to/output" --seed 1 --use-multithreading

(You can also use the shorthand alias pw-generate.)

[!NOTE] We do not support --use-multithreading on macOS yet, so you should skip this flag (or set it to False).

The following generation script creates datasets of various sizes with random generation seed 1:

./data/generate-v1.sh /path/to/output/ 1 --use-multithreading
  • Universe sizes 25, 50, 500, ..., 5K, 500K, 1M (number of documents)
  • Question template depth 20 (proportional to difficulty)

For example, it executes the following command to generate a size 5K universe (5000 = --max-family-tree-size * --num-family-trees):

pw-generate \
   -od /path/to/output/depth_20_size_5000_seed_1 \
   --seed 1 \
   --question-depth 20 \
   --num-family-trees 100 \
   --max-family-tree-size 50 \
   --max-family-tree-depth 20 \
   --article-format json \
   --question-format json \
   --use-multithreading

Pre-generated PhantomWiki datasets on Huggingface

For convenience of development, we provide pre-generated PhantomWiki datasets on HuggingFace (sizes 50, 500, and 5000 with seeds 1, 2, and 3).

from datasets import load_dataset

# Download the document corpus
ds_corpus = load_dataset("kilian-group/phantom-wiki-v1", "text-corpus")
# Download the question-answer pairs
ds_qa = load_dataset("kilian-group/phantom-wiki-v1", "question-answer")

🔗 Installing dependencies

PhantomWiki uses the Prolog logic programming language, available on all operating systems through SWI-Prolog. We recommend installing SWI-prolog through your distribution or through conda, for example:

# On macOS: with homebrew
brew install swi-prolog

# On Linux: with apt
sudo add-apt-repository ppa:swi-prolog/stable
sudo apt-get update
sudo apt-get install swi-prolog

# On Linux: with conda
conda install conda-forge::swi-prolog

# On Windows: download and install binary from https://www.swi-prolog.org/download/stable

Installing PhantomWiki in development mode

There are 2 options:

  1. (Recommended) Install the package in editable mode using pip:

    pip install -e .
    
  2. If you use VSCode, you can add to the python path without installing the package:

    1. Create a file in the repo root called .env
    2. Add PYTHONPATH=src
    3. Restart VSCode

🔢 Evaluating LLMs on PhantomWiki

First, install dependencies and vLLM to match your hardware (GPU, CPU, etc.):

pip install phantom-wiki[eval]
pip install "vllm>=0.6.6"

If you're installing from source, use pip install -e ".[eval]".

Setting up API keys

Anthropic
  1. Register an account with your cornell.edu email and join "Kilian's Group"
  2. Create an API key at https://console.anthropic.com/settings/keys under your name
  3. Set your Anthropic API key in your conda environment:
conda env config vars set ANTHROPIC_API_KEY=xxxxx

Rate limits: https://docs.anthropic.com/en/api/rate-limits#updated-rate-limits

:rotating_light: The Anthropic API has particularly low rate limits so it takes longer to get predictions.

Google Gemini
  1. Create an API key at https://aistudio.google.com/app/apikey (NOTE: for some reason, Google AI Studio is disabled for cornell.edu accounts, so use your personal account)
  2. Set your Gemini API key:
conda env config vars set GEMINI_API_KEY=xxxxx
OpenAI
  1. Register an account with your cornell.edu email at https://platform.openai.com/ and join "Kilian's Group"
  2. Create an API key at https://platform.openai.com/settings/organization/api-keys under your name
  3. Set your OpenAI API key in your conda environment:
conda env config vars set OPENAI_API_KEY=xxxxx

Rate limits: https://platform.openai.com/docs/guides/rate-limits#usage-tiers

TogetherAI
  1. Register for an account at https://api.together.ai
  2. Set your TogetherAI API key:
conda env config vars set TOGETHER_API_KEY=xxxxx
vLLM

Original setup instructions: https://docs.vllm.ai/en/stable/getting_started/installation.html#install-the-latest-code

Additional notes:

  • It's recommended to download the model manually:
huggingface-cli download MODEL_REPO_ID

Reproducing LLM evaluation results in the paper

[!NOTE] For vLLM inference, make sure to request access for Gemma, Llama 3.1, 3.2, and 3.3 models on HuggingFace before proceeding.

🧪 To generate the predictions, run the following command from the root directory:

python -m phantom_eval --method METHOD --model_name MODEL_NAME --split_list SPLIT_LIST -od OUTPUT_DIRECTORY

[!TIP] To generate a slurm script with the appropriate GPU allocation and inference config, run the create_eval.sh script and follow the prompted steps.

📊 To generate the tables and figures, run the following command from the root directory:

./eval/icml.sh OUTPUT_DIRECTORY METHOD

where OUTPUT_DIRECTORY and METHOD are the same as when generating the predictions. This script will create the following subdirectories in OUTPUT_DIRECTORY: scores/ and figures/.

📃 Citation

@article{gong2025phantomwiki,
  title={{PhantomWiki}: On-Demand Datasets for Reasoning and Retrieval Evaluation},
  author={Gong, Albert and Stankevi{\v{c}}i{\=u}t{\.e}, Kamil{\.e} and Wan, Chao and Kabra, Anmol and Thesmar, Raphael and Lee, Johann and Klenke, Julius and Gomes, Carla P and Weinberger, Kilian Q},
  journal={arXiv preprint arXiv:2502.20377},
  year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phantom_wiki-0.5.2.tar.gz (159.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phantom_wiki-0.5.2-py3-none-any.whl (155.1 kB view details)

Uploaded Python 3

File details

Details for the file phantom_wiki-0.5.2.tar.gz.

File metadata

  • Download URL: phantom_wiki-0.5.2.tar.gz
  • Upload date:
  • Size: 159.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phantom_wiki-0.5.2.tar.gz
Algorithm Hash digest
SHA256 f6d9b85229232db0a37fb7affd2d1564dc83aa830d8de1c866d99a0ac5340d23
MD5 1a04a3d0ed3c5d0cc0df54ae93f9ad4a
BLAKE2b-256 5bb70c5708c916526b55a8345d140a6e3318e9d01a1aeac2162b72fed352e65f

See more details on using hashes here.

Provenance

The following attestation bundles were made for phantom_wiki-0.5.2.tar.gz:

Publisher: python-publish.yml on kilian-group/phantom-wiki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phantom_wiki-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: phantom_wiki-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 155.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phantom_wiki-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 545384f4ea95042c0632f59a46298e4c95185fe3e37752d918ad669e55140c9a
MD5 2ad5fcd01d24502446545fcb628f08ca
BLAKE2b-256 a8a2a0133ecbfde8405924c271a68f769a21992c125c6b362be504af79595e3e

See more details on using hashes here.

Provenance

The following attestation bundles were made for phantom_wiki-0.5.2-py3-none-any.whl:

Publisher: python-publish.yml on kilian-group/phantom-wiki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page