Skip to main content

POPPER

Project description

POPPER: Automated Hypothesis Testing with Agentic Sequential Falsifications

This repository hosts the code base for the paper

Automated Agentic Hypothesis Testing with Sequential Falsifications

Kexin Huang*, Ying Jin*, Ryan Li*, Michael Y. Li, Emmanuel Candès, Jure Leskovec
Link to Paper

If you find this work useful, please consider cite:

@misc{popper,
      title={Automated Hypothesis Validation with Agentic Sequential Falsifications}, 
      author={Kexin Huang and Ying Jin and Ryan Li and Michael Y. Li and Emmanuel Candès and Jure Leskovec},
      year={2025},
      eprint={2502.09858},
      archivePrefix={arXiv}
}

Overview

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

logo

Installation

We highly recommend using a virtual environment to manage the dependencies.

conda create -n popper_env python=3.10
conda activate popper_env

For direct usage of Popper, you can install the package via pip:

pip install popper_agent

For source code development, you can clone the repository and install the package:

git clone https://github.com/snap-stanford/POPPER.git
cd POPPER
pip install -r requirements.txt

Add the OpenAI/Anthropic API key to the environment variables:

export OPENAI_API_KEY="YOUR_API_KEY"
export ANTHROPIC_API_KEY="YOUR_API_KEY"

Datasets will be automatically downloaded to specified data folder when you run the code.

Core API Usage

from popper import Popper

# Initialize the Popper agent
agent = Popper(llm="claude-3-5-sonnet-20240620")

# Register data for hypothesis testing; 
# for bio/discoverybench data in the paper, 
# it will be automatically downloaded to your specified data_path
agent.register_data(data_path='path/to/data', loader_type='bio')

# Configure the agent with custom parameters
agent.configure(
    alpha=0.1,
    max_num_of_tests=5,
    max_retry=3,
    time_limit=2,
    aggregate_test='E-value',
    relevance_checker=True,
    use_react_agent=True
)

# Validate a hypothesis
results = agent.validate(hypothesis="Your hypothesis here")

# Print the results
print(results)

Run on your own hypothesis and database

You can simply dump in a set of datasets in your domain (e.g. business, economics, political science, etc.) and run Popper on your own hypothesis. We only expect every file is in a csv or pkl format.

from popper import Popper   

agent = Popper(llm="claude-3-5-sonnet-20240620")
agent.configure(alpha = 0.1)
agent.register_data(data_path='path/to/data', loader_type='custom')
agent.validate(hypothesis = 'YOUR HYPOTHESIS')

logo

Hypothesis in Popper

You can arbitrarily define any free-form hypothesis. In the paper, we provide two types of hypothesis: biological hypothesis and discovery-bench hypothesis.

You can load the biological hypothesis with:

from popper.benchmark import gene_perturb_hypothesis
bm = gene_perturb_hypothesis(num_of_samples = samples, permuted=False, dataset = 'IL2', path = path)
example = bm.get_example(0)

It will return something like:

{'prompt': 'Gene VAV1 regulates the production of Interleukin-2 (IL-2).',
 'gene': 'VAV1',
 'answer': 2.916,
 'binary_answer': True}

num_of_samples is the number of samples you want to generate, permuted is whether you want to permute the dataset for type I error estimation, and dataset is the dataset you want to use and you can choose from IL2 and IFNG.

For discovery-bench, you can load the hypothesis with:

from popper.benchmark import discovery_bench_hypothesis
bm = discovery_bench_hypothesis(num_samples = samples, path = path)
example = bm.get_example(0)

It will return something like:

{'task': 'archaeology',
 'domain': 'humanities',
 'metadataid': 5,
 'query_id': 0,
 'prompt': 'From 1700 BCE onwards, the use of hatchets and swords increased while the use of daggers decreased.',
 'data_loader': <popper.utils.DiscoveryBenchDataLoader at 0x7c20793e9f70>,
 'answer': True}

As each hypothesis in discoverybench has its own associated dataset, the example will return data_loader its own dataset.

Run benchmarks in the paper

Bash scripts for reproducing the paper is provided in the benchmark_scripts/run_targetval.sh for TargetVal benchmark and benchmark_scripts/run_discoverybench.sh for DiscoveryBench benchmark.

Note: the Popper agent can read or write files to your filesystem. We recommend running the benchmark scripts inside a containerized environments. We have provided a working Dockerfile and an example script to launch a Docker container and execute the scripts in benchmark_scripts/run_discoverybench_docker.sh.

Acknowledgement

The DiscoveryBench benchmark and some of the baseline agents are built on top of allenai/discoverybench. Thanks for their awsome work!

UI interface [Coming soon!]

You can deploy a simple UI interface with one line of code using your datasets or our bio dataset - a gradio UI will be generated and you can interact with it to validate your hypothesis.

agent.launch_ui(share = True)

Contact

For any questions, please raise an issue in the GitHub or contact Kexin Huang (kexinh@cs.stanford.edu).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popper_agent-0.0.2.tar.gz (55.1 kB view details)

Uploaded Source

File details

Details for the file popper_agent-0.0.2.tar.gz.

File metadata

  • Download URL: popper_agent-0.0.2.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.12

File hashes

Hashes for popper_agent-0.0.2.tar.gz
Algorithm Hash digest
SHA256 1416abcfb5d479686066c60c89e579e6994aff618c922ee2a6244556e2904dc8
MD5 83e3823356dd9f6fe5cd8f8aa23ac76e
BLAKE2b-256 84becec9e91da7bfae86f6c83ca388a27707cc9f5d424efa6684e6d180bbf567

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page