LAB-Bench environments implemented with aviary

These details have not been verified by PyPI

Project description

aviary.labbench

LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.

Installation

To install the LAB-Bench environment, run:

pip install 'fhaviary[labbench]'

Usage

In labbench/env.py, you will find:

GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.
ImageQAEnvironment: an GradablePaperQAEnvironment subclass for QA where image(s) are pre-added.

And in labbench/task.py, you will find:

TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create one GradablePaperQAEnvironment per question.
ImageQATaskDataset: a task dataset that pairs with ImageQAEnvironment for FigQA or TableQA.

Here is an example of how to use them:

import os

from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings

from aviary.env import TaskDataset


async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_litqa_v2_papers)
    dataset = TaskDataset.from_name("litqa2", settings=settings)
    metrics_callback = MeanMetricsCallback(eval_dataset=dataset)

    evaluator = Evaluator(
        config=EvaluatorConfig(batch_size=3),
        agent=SimpleAgent(),
        dataset=dataset,
        callbacks=[metrics_callback],
    )
    await evaluator.evaluate()
    print(metrics_callback.eval_means)

Image Question-Answer

This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.

from typing import cast

import litellm
import pytest
from ldp.agent import Agent
from ldp.alg import (
    Evaluator,
    EvaluatorConfig,
    MeanMetricsCallback,
    StoreTrajectoriesCallback,
)
from paperqa.settings import AgentSettings, IndexSettings

from aviary.envs.labbench import (
    ImageQAEnvironment,
    ImageQATaskDataset,
    LABBenchDatasets,
)


@pytest.mark.asyncio
async def test_image_qa(tmp_path) -> None:
    litellm.num_retries = 8  # Mitigate connection-related failures
    settings = ImageQAEnvironment.make_base_settings()
    settings.agent = AgentSettings(
        agent_type="ldp.agent.SimpleAgent",
        index=IndexSettings(paper_directory=tmp_path),
        # TODO: add image support for paper_search
        tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
        agent_evidence_n=3,  # Bumped up to collect several perspectives
    )
    dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
    t_cb = StoreTrajectoriesCallback()
    m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
    evaluator = Evaluator(
        config=EvaluatorConfig(
            batch_size=256,  # Use batch size greater than FigQA size and TableQA size
            max_rollout_steps=18,  # Match aviary paper's PaperQA setting
        ),
        agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
        dataset=dataset,
        callbacks=[t_cb, m_cb],
    )
    await evaluator.evaluate()
    print(m_cb.eval_means)

References

[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.

[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.35.0

Apr 16, 2026

0.34.0

Mar 18, 2026

0.33.0

Feb 18, 2026

0.32.0

Jan 15, 2026

This version

0.31.0

Jan 6, 2026

0.30.0

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aviary_labbench-0.31.0.tar.gz (1.5 MB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aviary_labbench-0.31.0-py3-none-any.whl (13.7 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file aviary_labbench-0.31.0.tar.gz.

File metadata

Download URL: aviary_labbench-0.31.0.tar.gz
Upload date: Jan 6, 2026
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aviary_labbench-0.31.0.tar.gz
Algorithm	Hash digest
SHA256	`50d5ad56c62ecfcd57c0d81b1fa497d3e0ca58e34db8d6587fa93b7d2c5335cf`
MD5	`a015e879933b3fe3959bd6122fdf488e`
BLAKE2b-256	`a40e7258bd25c69f101bf969f40e046a807d6bd6a1a7c92780dddc57610efe71`

See more details on using hashes here.

File details

Details for the file aviary_labbench-0.31.0-py3-none-any.whl.

File metadata

Download URL: aviary_labbench-0.31.0-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aviary_labbench-0.31.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b434ebf57077ab77b25fba61ae1852f366f9e125457d1830ff003bd9109afe9`
MD5	`11729d56aa8224864013660d574de520`
BLAKE2b-256	`9a6aa947a14581216ca39f72963ea15eae390300e2173d2dbebd8d54a120dbb1`

See more details on using hashes here.

aviary.labbench 0.31.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

aviary.labbench

Installation

Usage

Image Question-Answer

References

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes