Skip to main content

LAB-Bench environments implemented with aviary

Project description

aviary.labbench

LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.

Installation

To install the LAB-Bench environment, run:

pip install 'fhaviary[labbench]'

Usage

In labbench/env.py, you will find:

  • GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.
  • ImageQAEnvironment: an GradablePaperQAEnvironment subclass for QA where image(s) are pre-added.

And in labbench/task.py, you will find:

  • TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create one GradablePaperQAEnvironment per question.
  • ImageQATaskDataset: a task dataset that pairs with ImageQAEnvironment for FigQA or TableQA.

Here is an example of how to use them:

import os

from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings

from aviary.env import TaskDataset


async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_litqa_v2_papers)
    dataset = TaskDataset.from_name("litqa2", settings=settings)
    metrics_callback = MeanMetricsCallback(eval_dataset=dataset)

    evaluator = Evaluator(
        config=EvaluatorConfig(batch_size=3),
        agent=SimpleAgent(),
        dataset=dataset,
        callbacks=[metrics_callback],
    )
    await evaluator.evaluate()
    print(metrics_callback.eval_means)

Image Question-Answer

This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.

from typing import cast

import litellm
import pytest
from ldp.agent import Agent
from ldp.alg import (
    Evaluator,
    EvaluatorConfig,
    MeanMetricsCallback,
    StoreTrajectoriesCallback,
)
from paperqa.settings import AgentSettings, IndexSettings

from aviary.envs.labbench import (
    ImageQAEnvironment,
    ImageQATaskDataset,
    LABBenchDatasets,
)


@pytest.mark.asyncio
async def test_image_qa(tmp_path) -> None:
    litellm.num_retries = 8  # Mitigate connection-related failures
    settings = ImageQAEnvironment.make_base_settings()
    settings.agent = AgentSettings(
        agent_type="ldp.agent.SimpleAgent",
        index=IndexSettings(paper_directory=tmp_path),
        # TODO: add image support for paper_search
        tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
        agent_evidence_n=3,  # Bumped up to collect several perspectives
    )
    dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
    t_cb = StoreTrajectoriesCallback()
    m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
    evaluator = Evaluator(
        config=EvaluatorConfig(
            batch_size=256,  # Use batch size greater than FigQA size and TableQA size
            max_rollout_steps=18,  # Match aviary paper's PaperQA setting
        ),
        agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
        dataset=dataset,
        callbacks=[t_cb, m_cb],
    )
    await evaluator.evaluate()
    print(m_cb.eval_means)

References

[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.

[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aviary_labbench-0.31.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aviary_labbench-0.31.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file aviary_labbench-0.31.0.tar.gz.

File metadata

  • Download URL: aviary_labbench-0.31.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aviary_labbench-0.31.0.tar.gz
Algorithm Hash digest
SHA256 50d5ad56c62ecfcd57c0d81b1fa497d3e0ca58e34db8d6587fa93b7d2c5335cf
MD5 a015e879933b3fe3959bd6122fdf488e
BLAKE2b-256 a40e7258bd25c69f101bf969f40e046a807d6bd6a1a7c92780dddc57610efe71

See more details on using hashes here.

File details

Details for the file aviary_labbench-0.31.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aviary_labbench-0.31.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b434ebf57077ab77b25fba61ae1852f366f9e125457d1830ff003bd9109afe9
MD5 11729d56aa8224864013660d574de520
BLAKE2b-256 9a6aa947a14581216ca39f72963ea15eae390300e2173d2dbebd8d54a120dbb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page