Convert academic papers into benchmark tasks for evaluating AI agents.

These details have not been verified by PyPI

Project links

Project description

Paper2Bench

Convert academic papers into benchmark tasks for evaluating AI coding agents.

Paper2Bench offers two complementary workflows:

Core pipeline — a reproducible task per paper: extract what the paper used (models, datasets, budget), hand it to an agent, and score the agent's conclusion against the paper's.
Benchmark-variant generator — turn a paper into a family of new research problems (perturbations, ablations, future-work tasks) that test whether an agent can transfer the paper's logic to new settings.

Both workflows auto-detect the paper's archetype (llm_evaluation / novel_architecture / empirical_study) and route extraction through a matching prompt and template — so papers like CGCNN or SchNet don't get shoehorned into an LLM-evaluation schema.

Install
Quick start
Pipelines at a glance
Commands
Usage guide
Worked examples
Output files
How it works

Install

For development:

git clone https://github.com/AbhayAnandUCSD/Paper2Bench.git
cd Paper2Bench
pip install -e .

Or install directly from GitHub:

pip install git+https://github.com/AbhayAnandUCSD/Paper2Bench.git@v0.3.0

If you want to use Claude models, install the optional anthropic extra:

pip install "paper2bench[anthropic] @ git+https://github.com/AbhayAnandUCSD/Paper2Bench.git@v0.3.0"

Requires Python 3.10+. Set at least one provider key (both may coexist in a .env file):

export OPENAI_API_KEY=sk-...           # required for gpt-* / o-* models
export ANTHROPIC_API_KEY=sk-ant-...    # required for claude-* models

Provider selection is automatic based on the model name:

gpt-*, o1-*, o3-*, o4-* → OpenAI
claude-*, anthropic/claude-* → Anthropic

So --model claude-opus-4-6 just works if ANTHROPIC_API_KEY is set and the anthropic extra is installed. The default model is gpt-4o. Each stage that accepts --model (or --agent-model / --eval-model) can be pointed at either provider independently.

Note on Claude + --split-rqs: a single paper's run makes ~6 LLM calls per RQ, each with ~45k input tokens. Lower Anthropic tiers (30k input-tokens/min) will hit rate limits on papers with many RQs. Workarounds: run one RQ at a time via --research-question "..." instead of --split-rqs, or use OpenAI for batch runs.

Python SDK

The pipeline is also importable as a library — useful when you want to feed Paper2Bench programmatically rather than via the CLI:

import paper2bench

result = paper2bench.run(
    pdf="paper.pdf",
    research_question="Does X improve Y?",
    api_key="sk-...",                  # or rely on OPENAI_API_KEY / ANTHROPIC_API_KEY env vars
    output_dir="./output",
)

print(result.paper_type)               # auto-classified, or pass paper_type=... to override
print(result.instruction_path)         # ./output/tasks/<task_id>/instruction.txt
print(result.instruction_gt_path)      # ground-truth experimental plan
print(result.config)                   # parsed task_config.yaml as dict

Stage functions (parse_paper_to_tree, classify_paper, extract_yaml_from_pdf, render_instruction, generate_supplementary_plan, build_instruction_gt, verify_instruction_gt, extract_paper_spec, generate_variants, chat) are also importable from paper2bench for power users who want to compose the pipeline manually.

Quick start

One-shot benchmark from an arXiv query:

paper2bench run \
  --task-id reversal_curse \
  --query "The Reversal Curse Berglund" \
  --research-question "If an LLM is fine-tuned on 'A is B,' does it learn 'B is A'?" \
  --output-dir ./output/ \
  --auto

--auto re-ranks arXiv results by title similarity and picks the top match if it scores at or above --min-similarity (default 0.4). If your query phrasing diverges from the canonical title and --auto aborts, pass a lower threshold like --min-similarity 0.3 — the abort message will tell you what value to try.

Generate a family of variant benchmarks from an existing PDF:

paper2bench specextract --pdf paper.pdf -o spec.json
paper2bench generate    --spec spec.json --pdf paper.pdf -o ./variants/

Pipelines at a glance

Core pipeline — paper → single benchmark task:

Paper title/author
  → paper2bench download   → PDF
  → paper2bench parse      → problem tree JSON
  → paper2bench classify   → paper_type (auto)
  → paper2bench extract    → task_config.yaml
  → paper2bench render     → instruction.txt
  → paper2bench plan       → instruction_gt.txt   (ground-truth plan)
  → paper2bench verify     → F1 check             (optional)

Benchmark-variant generator — paper → family of variant tasks:

PDF
  → paper2bench specextract → paper_spec.json  (7-component spec)
  → paper2bench generate    → variants/<id>/{instruction.txt, task_config.yaml, metadata.json}

The two workflows are independent — you can use either on its own.

Commands

Command	Purpose
`download`	Search arXiv and download a paper PDF
`parse`	Parse a PDF into a 3-level problem tree
`classify`	Classify a paper as `llm_evaluation` / `novel_architecture` / `empirical_study`
`extract`	Extract task config YAML (paper-type aware)
`render`	Render YAML into `instruction.txt`
`plan`	Build `instruction_gt.txt` from instruction + PDF
`verify`	Run an agent with `instruction_gt` and check F1 ≥ 80
`run`	Core pipeline end-to-end
`specextract`	Extract 7-component paper specification
`generate`	Generate benchmark variant instances from a paper spec

Usage guide

Core pipeline, one command

paper2bench run \
  --task-id my_task \
  --query "Paper title and authors" \
  --research-question "The question the agent will try to answer" \
  --output-dir ./output/ \
  --auto

Useful run flags:

Flag	Effect
`--pdf PATH`	Use a local PDF instead of searching arXiv
`--paper-type TYPE`	Skip auto-classification and force `llm_evaluation` / `novel_architecture` / `empirical_study`
`--skip-hf-validation`	Skip the HuggingFace-Hub existence check on dataset loaders
`--split-rqs`	Emit a separate benchmark task per research question in the parsed tree (`--research-question` becomes optional)
`--verify`	After generating `instruction_gt.txt`, run an agent with it and check F1 ≥ 80
`--template PATH`	Use a custom instruction template
`--model MODEL`	Change the LLM (default `gpt-4o`)

Core pipeline, step by step

Run any stage on its own — each writes a self-contained artifact.

# 1. Download from arXiv
paper2bench download "Paper title and authors" --task-id my_task -o ./papers/

# 2. Parse into problem tree
paper2bench parse ./papers/my_task.pdf -o ./trees/my_task_tree.json

# 3. (optional) Classify standalone
paper2bench classify --pdf ./papers/my_task.pdf -o ./tasks/my_task/classification.json

# 4. Extract task config (paper-type aware)
paper2bench extract \
  --pdf ./papers/my_task.pdf \
  --research-question "The question" \
  --task-id my_task \
  --tree ./trees/my_task_tree.json \
  --paper-type llm_evaluation \
  -o ./tasks/my_task/task_config.yaml

# 5. Render instruction.txt
paper2bench render \
  --config ./tasks/my_task/task_config.yaml \
  -o ./tasks/my_task/instruction.txt

# 6. Generate ground-truth plan
paper2bench plan \
  --instruction ./tasks/my_task/instruction.txt \
  --pdf ./papers/my_task.pdf \
  --tree ./trees/my_task_tree.json \
  -o ./tasks/my_task/instruction_gt.txt

# 7. (optional) Verify clarity
paper2bench verify \
  --instruction-gt ./tasks/my_task/instruction_gt.txt \
  --pdf ./papers/my_task.pdf \
  -o ./tasks/my_task/verify_results.json

Paper types and auto-classification

Every paper is auto-classified into one of three archetypes; the type drives both the extraction prompt and the instruction template. Pass --paper-type anywhere to override.

Type	When it fires	Schema highlights	Template
`llm_evaluation`	Paper evaluates existing models on a task (e.g. Lost in the Middle, Chain-of-Thought)	`models.api`, `models.huggingface`, `datasets`	`default.txt`
`novel_architecture`	Paper introduces a new model / method (e.g. CGCNN, SchNet, Transformer)	`proposed_method`, `baselines`, `reference_implementation`	`novel_architecture.txt`
`empirical_study`	Observational / meta-study, no new model	`study_type`, `data_sources`, `analytical_tools`	`empirical_study.txt`

All three schemas also include:

references — author-year citations get their own field so they aren't mis-extracted as fake synthetic datasets.
HuggingFace loader validation — any source: huggingface loader is checked against the Hub; missing repos are demoted to source: unknown (bypass with --skip-hf-validation).

Verification (F1 gate)

paper2bench verify runs a coding agent with instruction_gt.txt, extracts its final conclusion, decomposes both the conclusion and a paper-derived reference answer into atomic claims, and computes claim-level precision / recall / F1. The task passes if F1 ≥ 80.

Verification is opt-in and not suitable for papers whose experiments can't be executed in a Python sandbox.

paper2bench verify \
  --instruction-gt ./tasks/my_task/instruction_gt.txt \
  --pdf ./papers/my_paper.pdf \
  --data-dir ./tasks/my_task/data \
  --agent-model gpt-4o \
  --eval-model gpt-4o

Splitting papers with multiple research questions

Pass --split-rqs on run to emit a separate benchmark task per research question found in the parsed problem tree. Each sub-task lands in tasks/<task_id>_rqN/ with its own config, instruction, and instruction_gt.

paper2bench run --task-id my_task --pdf paper.pdf --split-rqs

Benchmark-variant generator

For evaluating transfer rather than reproduction, specextract + generate turn a paper into a family of new research problems:

Perturbations — new dataset, tighter budget, different metric, shifted domain
Ablation-derived — does the key component still matter under a shift? Which fails first?
Future-work-derived — bounded instantiations of the paper's stated limitations

Each variant includes an automated leakage check (direct-answer / method / statistical leak) so instructions that accidentally reveal the paper's answer get flagged.

# Extract the 7-component paper specification
paper2bench specextract --pdf paper.pdf -o spec.json

# Generate a family of variants (auto-classifies paper type; override with --paper-type)
paper2bench generate --spec spec.json --pdf paper.pdf -o ./variants/

Each variant directory contains:

instruction.txt — a detailed research instruction, same format as the core pipeline's instruction.txt (research question + models/method + datasets with loaders + budget + constraints), rendered through the paper-type-aware template.
task_config.yaml — the resource spec that was rendered into the instruction.
metadata.json — transformation type, difficulty, rationale, curator-only expected outcome hint, and leakage-check verdict.

Because the generator reuses the paper-type branching, variants of a materials-science paper tell the agent "implement the method yourself" while variants of an LLM-evaluation paper list concrete API model IDs and HuggingFace loaders.

Custom templates

Pass --template to render (or run) to use your own template:

paper2bench render --config task.yaml --template my_template.txt -o instruction.txt

Template placeholders: {research_question}, {models_section}, {datasets_section}, {budget_per_model}, {constraints_section}.

Worked examples

Three papers run end-to-end through the core pipeline, showing how the classifier routes each one to a different schema. Each example shows the full rendered instruction.txt (what the agent actually sees) and the extracted task_config.yaml that produced it.

Lost in the Middle (NLP / LLM)

Classified llm_evaluation (high confidence). The LLM-evaluation template lists concrete API + HuggingFace models, dataset loaders, and budget — the agent is expected to evaluate these models, not implement a new one.

Rendered instruction.txt (what the agent sees)

You are a research agent. Conduct research and experiment about the question: "How does model performance vary based on relevant information position in context?"

You have access to the following resources:

Models:
- gpt-3.5-turbo-0613 and gpt-3.5-turbo-16k-0613 and claude-1.3 and claude-1.3-100k via API
- Load with HuggingFace: mpt-30b-instruct
- Load with HuggingFace: longchat-13b-16k
- Load with HuggingFace: flan-t5-xxl
- Load with HuggingFace: flan-ul2
- Load with HuggingFace: Llama-2-7b-chat-hf
- Load with HuggingFace: Llama-2-13b-chat-hf
- Load with HuggingFace: Llama-2-70b-chat-hf
- Load with HuggingFace: Llama-2-7b-hf
- Load with HuggingFace: Llama-2-13b-hf
- Load with HuggingFace: Llama-2-70b-hf
- Computational budget: 1000 API calls per model

Datasets:
- NaturalQuestions-Open: A dataset containing historical queries issued to the Google search engine, coupled with human-annotated answers extracted from Wikipedia.  [loader unverified -- locate the data yourself]
- Synthetic JSON-formatted key-value pairs: A synthetic dataset of JSON-formatted key-value pairs with unique, randomly-generated UUIDs as keys and values.  [generate programmatically]
  Generation code:
    import uuid
    def generate_synthetic_kv_pairs(num_pairs):
        return {str(uuid.uuid4()): str(uuid.uuid4()) for _ in range(num_pairs)}
- Referenced but not loaded: Lee et al. 2019
- Referenced but not loaded: Kwiatkowski et al. 2019

Experimental constraints:
- Do NOT use web search
- Run FULL end-to-end experiments

Please design and execute experiments to investigate this research question. Document your experimental plan, run end-to-end experiments, and provide conclusions at different levels of detail.

Extracted task_config.yaml

paper_type: llm_evaluation
models:
  api:         [gpt-3.5-turbo-0613, gpt-3.5-turbo-16k-0613, claude-1.3, claude-1.3-100k]
  huggingface: [mpt-30b-instruct, longchat-13b-16k, flan-t5-xxl, flan-ul2,
                Llama-2-{7b,13b,70b}-{hf,chat-hf}]
datasets:
  - name: NaturalQuestions-Open
    source: unknown                          # demoted — no canonical HF path
  - name: Synthetic JSON key-value pairs
    source: synthetic
    loader: |
      import uuid
      def generate_synthetic_kv_pairs(n):
          return {str(uuid.uuid4()): str(uuid.uuid4()) for _ in range(n)}
references: [Lee et al. 2019, Kwiatkowski et al. 2019]

CGCNN (materials science)

Classified novel_architecture (high confidence). The template tells the agent to implement the method from scratch — not call a library that bundles the paper's code — and hands it the method's key components, baselines, and a reference repo hint.

Rendered instruction.txt

You are a research agent. Conduct research about the question: "Can graph convolutional neural networks predict material properties directly from crystal structure?"

Your central task is to **implement a proposed method from scratch** and evaluate whether it answers the research question. Do NOT import a library that bundles the paper's exact code; the benchmark value comes from your own implementation.

Proposed method to implement:
- Crystal Graph Convolutional Neural Networks (CGCNN): CGCNN is a framework that represents crystal structures as graphs, where nodes represent atoms and edges represent bonds. A convolutional neural network is applied to these graphs to predict material properties directly from the crystal structure, achieving accuracy comparable to DFT calculations while providing interpretability by extracting contributions from local chemical environments.
  - Component: Crystal graph representation with nodes as atoms and edges as bonds
  - Component: Graph convolutional layers to update atom feature vectors
  - Component: Pooling layers to aggregate features into a crystal-level representation
  - Component: Fully-connected layers for property prediction
- Baseline for comparison: DFT calculations
- Reference hint: https://github.com/txie-93/cgcnn

Datasets for training and evaluation:
- Materials Project: A database of inorganic crystal structures and their properties, used for training and evaluating the CGCNN model.  [loader unverified -- locate the data yourself]
- Perovskite database: A dataset containing energy above hull data for perovskite crystals, used to demonstrate the interpretability of CGCNN.  [loader unverified -- locate the data yourself]
- Referenced but not loaded: Jain et al. 2013
- Referenced but not loaded: Kirklin et al. 2015
- Referenced but not loaded: De Jong et al. 2015

Experimental constraints:
- Do NOT use web search
- Run FULL end-to-end experiments
- Implement the proposed method yourself -- do not import it from a library that bundles the paper's exact code.
- Computational budget: 1000 training runs / evaluations total

Deliverables:
1. A working implementation of the proposed method in Python.
2. Training and evaluation runs on the specified datasets.
3. Quantitative results (metrics of your choice, justified against the research question).
4. A written conclusion tying the results back to the research question.

Design your own experimental protocol, metrics, and baselines. Run end-to-end experiments and report results honestly.

Extracted task_config.yaml

paper_type: novel_architecture
proposed_method:
  name: Crystal Graph Convolutional Neural Networks (CGCNN)
  summary: Represents crystal structures as graphs (atoms=nodes, bonds=edges);
    a graph CNN predicts material properties directly from structure.
  key_components:
    - Crystal graph representation
    - Graph convolutional layers
    - Pooling layer
    - Fully-connected prediction head
baselines:                 [DFT calculations]
reference_implementation:  [https://github.com/txie-93/cgcnn]
datasets:
  - {name: Materials Project,   source: unknown}
  - {name: Perovskite database, source: unknown}
references: [Jain et al. 2013, Kirklin et al. 2015, De Jong et al. 2015]

SchNet (quantum chemistry)

Classified novel_architecture (high confidence). The HF-loader validator rejected all three claimed dataset paths (qm9, md17, iso17) as non-existent on the Hub — they were demoted to source: unknown so the agent knows to locate the data itself rather than waste time on hallucinated loaders.

Rendered instruction.txt

You are a research agent. Conduct research about the question: "Can continuous-filter convolutional neural networks accurately model quantum-mechanical interactions in molecules?"

Your central task is to **implement a proposed method from scratch** and evaluate whether it answers the research question. Do NOT import a library that bundles the paper's exact code; the benchmark value comes from your own implementation.

Proposed method to implement:
- SchNet: SchNet is a deep learning architecture that uses continuous-filter convolutional layers to model quantum interactions in molecules. It respects essential quantum-chemical constraints, providing rotationally invariant energy predictions and rotationally equivariant force predictions. SchNet is designed to handle molecules with arbitrary atomic positions, ensuring a smooth potential energy surface and energy-conserving force fields.
  - Component: Continuous-filter convolutional layers
  - Component: Rotationally invariant energy prediction
  - Component: Rotationally equivariant force prediction
  - Component: Atom-wise layers and interaction blocks
- Baseline for comparison: Gradient-domain machine learning (GDML)
- Baseline for comparison: Deep tensor neural networks (DTNN)
- Baseline for comparison: enn-s2s

Datasets for training and evaluation:
- QM9: A benchmark dataset for predicting various molecular properties in equilibrium, consisting of approximately 130k organic molecules with up to 9 heavy atoms.  [loader unverified -- locate the data yourself]
- MD17: A collection of molecular dynamics simulations for small organic molecules, used for predicting energy-conserving force fields.  [loader unverified -- locate the data yourself]
- ISO17: A dataset consisting of short molecular dynamics trajectories of 129 isomers, used to evaluate the model's ability to represent complex potential energy surfaces with chemical and conformational changes.  [loader unverified -- locate the data yourself]
- Referenced but not loaded: Ramakrishnan et al. 2014

Experimental constraints:
- Do NOT use web search
- Run FULL end-to-end experiments
- Implement the proposed method yourself -- do not import it from a library that bundles the paper's exact code.
- Computational budget: 1000 training runs / evaluations total

Deliverables:
1. A working implementation of the proposed method in Python.
2. Training and evaluation runs on the specified datasets.
3. Quantitative results (metrics of your choice, justified against the research question).
4. A written conclusion tying the results back to the research question.

Design your own experimental protocol, metrics, and baselines. Run end-to-end experiments and report results honestly.

Extracted task_config.yaml

paper_type: novel_architecture
proposed_method:
  name: SchNet
  summary: Continuous-filter convolutional network modeling quantum interactions,
    with rotationally invariant energies and equivariant forces.
  key_components:
    - Continuous-filter convolutional layers
    - Rotationally invariant energy prediction
    - Rotationally equivariant force prediction
    - Atom-wise interaction blocks
baselines: [GDML, DTNN, enn-s2s]
datasets:
  - {name: QM9,   source: unknown}           # HF validator rejected bogus loader
  - {name: MD17,  source: unknown}
  - {name: ISO17, source: unknown}
references: [Ramakrishnan et al. 2014]

Before paper-type branching, CGCNN produced empty model lists and SchNet hallucinated meta-llama/Llama-3.1-8B-Instruct as a "model for the task" — neither paper has anything to do with LLMs.

Example variant (from the benchmark generator)

Running specextract + generate on CGCNN produced six variants across perturbation / ablation / future-work. Here is one — a perturbation that changes the empirical setting while preserving the paper's core hypothesis. Note the detailed instruction.txt uses the same novel_architecture template as the core pipeline, so the agent is still told to "implement the method yourself":

Variant: cgcnn_benchmark_elastic_properties — perturbation, medium difficulty, leakage-clean

instruction.txt

You are a research agent. Conduct research about the question: "Can CGCNN improve prediction accuracy for elastic properties with an increased amount of training data?"

Your central task is to **implement a proposed method from scratch** and evaluate whether it answers the research question. Do NOT import a library that bundles the paper's exact code; the benchmark value comes from your own implementation.

Proposed method to implement:
- Crystal Graph Convolutional Neural Network (CGCNN): The CGCNN framework represents crystal structures as graphs where nodes are atoms and edges are bonds. A convolutional neural network is applied to these graphs to learn features that predict material properties. The model is trained using DFT-calculated data and can extract contributions from local chemical environments to global properties.
  - Component: Crystal Graph
  - Component: Convolutional Layers
  - Component: Pooling Layer
  - Component: Fully-Connected Layers
- Baseline for comparison: DFT calculations compared to experimental data
- Reference hint: https://github.com/txie-93/cgcnn

Datasets for training and evaluation:
- Extended Materials Project Database: A larger set of inorganic crystals with DFT-calculated elastic properties, including bulk and shear moduli.  [loader unverified -- locate the data yourself]
- Referenced but not loaded: Xie-Grossman-2018

Experimental constraints:
- Do NOT use web search
- Run FULL end-to-end experiments
- Implement the proposed method yourself -- do not import a library that bundles the paper's exact code.
- Computational budget: 1000 training runs / evaluations total

Deliverables:
1. A working implementation of the proposed method in Python.
2. Training and evaluation runs on the specified datasets.
3. Quantitative results (metrics of your choice, justified against the research question).
4. A written conclusion tying the results back to the research question.

Design your own experimental protocol, metrics, and baselines. Run end-to-end experiments and report results honestly.

metadata.json (summary)

{
  "id": "cgcnn_benchmark_elastic_properties",
  "title": "Benchmarking CGCNN on Elastic Properties with Extended Dataset",
  "type": "perturbation",
  "difficulty": "medium",
  "scientific_question": "Can CGCNN improve prediction accuracy for elastic properties with an increased amount of training data?",
  "rationale": "The original paper noted higher errors for elastic properties due to limited data. Increasing the dataset size should test if the model's accuracy improves as hypothesized.",
  "expected_outcome_hint": "The CGCNN should show improved accuracy for elastic properties with the increased dataset size.",
  "leakage_check": { "leaked": false, "leakage_type": "none", "evidence": "" }
}

Output files

Core pipeline (tasks/<task_id>/):

File	Purpose
`<task_id>.pdf`	Downloaded paper
`<task_id>_tree.json`	Structured problem tree (Root → Research Questions → Experiments)
`task_config.yaml`	Extracted resources — paper-type-aware schema
`instruction.txt`	What the AI agent sees — research question + resources
`instruction_gt.txt`	Ground truth — detailed experimental procedures
`verify_results.json`	Precision / recall / F1 / pass verdict (if `verify` ran)

Benchmark-variant generator (variants/<paper>/):

File	Purpose
`<paper>.spec.json`	7-component paper specification
`instances.json`	Full manifest of all generated variants
`<instance>/instruction.txt`	Detailed research instruction for the variant
`<instance>/task_config.yaml`	Resource spec rendered into the variant's instruction
`<instance>/metadata.json`	Variant metadata: type, difficulty, rationale, leakage check

How it works

Core pipeline

Download — arXiv search by title/author with exponential backoff on HTTP 429.
Parse — LLM decomposes the paper into a 3-level problem tree (root → research questions → experiments).
Classify — a small LLM call over the paper's head tags it as llm_evaluation, novel_architecture, or empirical_study. Drives everything downstream.
Extract — one of three type-specific prompts pulls a structured resource spec (models / method / datasets / budget / constraints) into YAML. HuggingFace loaders are checked against the Hub; hallucinated IDs are demoted to source: unknown. Author-year citations go into a separate references field.
Render — a type-aware template converts the YAML into a standardized instruction.txt. Deliberately withholds the paper's methodology so the agent must design experiments.
Plan — LLM reads the paper again and writes detailed experimental procedures as a supplementary plan, combined with the base instruction to produce instruction_gt.txt (for evaluation, not for the agent).
Verify (optional) — runs a coding agent against instruction_gt.txt, then scores the agent's conclusion with claim-level precision / recall / F1 (adapted from FIRE-Bench's RAGChecker evaluator).

Benchmark-variant generator

specextract — distills the paper into a 7-component specification: scientific question, method (summary + components), claims, evaluation protocol, assumptions, ablation structure, future work.
generate — given the spec, emits a JSON family of 6–12 instances across three transformation strategies (perturbation / ablation / future-work), each with a paper-type-matching task_config. Each config is rendered into a detailed instruction.txt via the same render_instruction function used by the core pipeline, then audited by an LLM judge for three leakage types.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

May 22, 2026

0.3.1

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2bench-0.3.2.tar.gz (69.5 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paper2bench-0.3.2-py3-none-any.whl (70.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file paper2bench-0.3.2.tar.gz.

File metadata

Download URL: paper2bench-0.3.2.tar.gz
Upload date: May 22, 2026
Size: 69.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for paper2bench-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`c2486de99970706a986ebb342aed11b501c5a834c1b680f6b9929a5825eb09fd`
MD5	`19fd2591544f21e0ce5912451cc4125f`
BLAKE2b-256	`1344fbafd152bf3df3a9acc00498d4f329d2bab8b2da4ce813509e74b974d50a`

See more details on using hashes here.

File details

Details for the file paper2bench-0.3.2-py3-none-any.whl.

File metadata

Download URL: paper2bench-0.3.2-py3-none-any.whl
Upload date: May 22, 2026
Size: 70.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for paper2bench-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38c581500276d476cdef2606a1b82fe81ad1913ca74652268596031f5e7f0f6b`
MD5	`56b92f0da3b2d551ed092173820f3f90`
BLAKE2b-256	`598ef293268f4a018b9ad71c7a395adbf5eadcec8035249e53ac9936183db90f`

See more details on using hashes here.

paper2bench 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Paper2Bench

Contents

Install

Python SDK

Quick start

Pipelines at a glance

Commands

Usage guide

Core pipeline, one command

Core pipeline, step by step

Paper types and auto-classification

Verification (F1 gate)

Splitting papers with multiple research questions

Benchmark-variant generator

Custom templates

Worked examples

Lost in the Middle (NLP / LLM)

CGCNN (materials science)

SchNet (quantum chemistry)

Example variant (from the benchmark generator)

Output files

How it works

Core pipeline

Benchmark-variant generator

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes