POPPER

Project description

POPPER: Automated Hypothesis Testing with Agentic Sequential Falsifications

This repository hosts the code base for the paper

Automated Agentic Hypothesis Testing with Sequential Falsifications

Kexin Huang*, Ying Jin*, Ryan Li*, Michael Y. Li, Emmanuel Candès, Jure Leskovec
Link to Paper

If you find this work useful, please consider cite:

@misc{popper,
      title={Automated Hypothesis Validation with Agentic Sequential Falsifications}, 
      author={Kexin Huang and Ying Jin and Ryan Li and Michael Y. Li and Emmanuel Candès and Jure Leskovec},
      year={2025},
      eprint={2502.09858},
      archivePrefix={arXiv}
}

Overview

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

logo

Installation

We highly recommend using a virtual environment to manage the dependencies.

conda create -n popper_env python=3.10
conda activate popper_env

For direct usage of Popper, you can install the package via pip:

pip install popper_agent

For source code development, you can clone the repository and install the package:

git clone https://github.com/snap-stanford/POPPER.git
cd POPPER
pip install -r requirements.txt

Add the OpenAI/Anthropic API key to the environment variables:

export OPENAI_API_KEY="YOUR_API_KEY"
export ANTHROPIC_API_KEY="YOUR_API_KEY"

Datasets will be automatically downloaded to specified data folder when you run the code.

Core API Usage

from popper import Popper

# Initialize the Popper agent
agent = Popper(llm="claude-3-5-sonnet-20240620")

# Register data for hypothesis testing; 
# for bio/discoverybench data in the paper, 
# it will be automatically downloaded to your specified data_path
agent.register_data(data_path='path/to/data', loader_type='bio')

# Configure the agent with custom parameters
agent.configure(
    alpha=0.1,
    max_num_of_tests=5,
    max_retry=3,
    time_limit=2,
    aggregate_test='E-value',
    relevance_checker=True,
    use_react_agent=True
)

# Validate a hypothesis
results = agent.validate(hypothesis="Your hypothesis here")

# Print the results
print(results)

Run on your own hypothesis and database

You can simply dump in a set of datasets in your domain (e.g. business, economics, political science, etc.) and run Popper on your own hypothesis. We only expect every file is in a csv or pkl format.

from popper import Popper   

agent = Popper(llm="claude-3-5-sonnet-20240620")
agent.configure(alpha = 0.1)
agent.register_data(data_path='path/to/data', loader_type='custom')
agent.validate(hypothesis = 'YOUR HYPOTHESIS')

logo

Hypothesis in Popper

You can arbitrarily define any free-form hypothesis. In the paper, we provide two types of hypothesis: biological hypothesis and discovery-bench hypothesis.

You can load the biological hypothesis with:

from popper.benchmark import gene_perturb_hypothesis
bm = gene_perturb_hypothesis(num_of_samples = samples, permuted=False, dataset = 'IL2', path = path)
example = bm.get_example(0)

It will return something like:

{'prompt': 'Gene VAV1 regulates the production of Interleukin-2 (IL-2).',
 'gene': 'VAV1',
 'answer': 2.916,
 'binary_answer': True}

num_of_samples is the number of samples you want to generate, permuted is whether you want to permute the dataset for type I error estimation, and dataset is the dataset you want to use and you can choose from IL2 and IFNG.

For discovery-bench, you can load the hypothesis with:

from popper.benchmark import discovery_bench_hypothesis
bm = discovery_bench_hypothesis(num_samples = samples, path = path)
example = bm.get_example(0)

It will return something like:

{'task': 'archaeology',
 'domain': 'humanities',
 'metadataid': 5,
 'query_id': 0,
 'prompt': 'From 1700 BCE onwards, the use of hatchets and swords increased while the use of daggers decreased.',
 'data_loader': <popper.utils.DiscoveryBenchDataLoader at 0x7c20793e9f70>,
 'answer': True}

As each hypothesis in discoverybench has its own associated dataset, the example will return data_loader its own dataset.

Run benchmarks in the paper

Bash scripts for reproducing the paper is provided in the benchmark_scripts/run_targetval.sh for TargetVal benchmark and benchmark_scripts/run_discoverybench.sh for DiscoveryBench benchmark.

Note: the Popper agent can read or write files to your filesystem. We recommend running the benchmark scripts inside a containerized environments. We have provided a working Dockerfile and an example script to launch a Docker container and execute the scripts in benchmark_scripts/run_discoverybench_docker.sh.

Acknowledgement

The DiscoveryBench benchmark and some of the baseline agents are built on top of allenai/discoverybench. Thanks for their awsome work!

UI interface [Coming soon!]

You can deploy a simple UI interface with one line of code using your datasets or our bio dataset - a gradio UI will be generated and you can interact with it to validate your hypothesis.

agent.launch_ui(share = True)

Contact

For any questions, please raise an issue in the GitHub or contact Kexin Huang (kexinh@cs.stanford.edu).

Project details

Release history Release notifications | RSS feed

0.0.5

Apr 26, 2025

0.0.3

Feb 18, 2025

This version

0.0.2

Feb 17, 2025

0.0.1

Feb 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popper_agent-0.0.2.tar.gz (55.1 kB view details)

Uploaded Feb 17, 2025 Source

File details

Details for the file popper_agent-0.0.2.tar.gz.

File metadata

Download URL: popper_agent-0.0.2.tar.gz
Upload date: Feb 17, 2025
Size: 55.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.12

File hashes

Hashes for popper_agent-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`1416abcfb5d479686066c60c89e579e6994aff618c922ee2a6244556e2904dc8`
MD5	`83e3823356dd9f6fe5cd8f8aa23ac76e`
BLAKE2b-256	`84becec9e91da7bfae86f6c83ca388a27707cc9f5d424efa6684e6d180bbf567`

See more details on using hashes here.

popper-agent 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta