Skip to main content

LAB-Bench environments implemented with aviary

Project description

aviary.labbench

LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.

Installation

To install the LAB-Bench environment, run:

pip install 'fhaviary[labbench]'

Usage

In labbench/env.py, you will find:

  • GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.
  • ImageQAEnvironment: an GradablePaperQAEnvironment subclass for QA where image(s) are pre-added.

And in labbench/task.py, you will find:

  • TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create one GradablePaperQAEnvironment per question.
  • ImageQATaskDataset: a task dataset that pairs with ImageQAEnvironment for FigQA or TableQA.

Here is an example of how to use them:

import os

from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings

from aviary.env import TaskDataset


async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_litqa_v2_papers)
    dataset = TaskDataset.from_name("litqa2", settings=settings)
    metrics_callback = MeanMetricsCallback(eval_dataset=dataset)

    evaluator = Evaluator(
        config=EvaluatorConfig(batch_size=3),
        agent=SimpleAgent(),
        dataset=dataset,
        callbacks=[metrics_callback],
    )
    await evaluator.evaluate()
    print(metrics_callback.eval_means)

Image Question-Answer

This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.

from typing import cast

import litellm
import pytest
from ldp.agent import Agent
from ldp.alg import (
    Evaluator,
    EvaluatorConfig,
    MeanMetricsCallback,
    StoreTrajectoriesCallback,
)
from paperqa.settings import AgentSettings, IndexSettings

from aviary.envs.labbench import (
    ImageQAEnvironment,
    ImageQATaskDataset,
    LABBenchDatasets,
)


@pytest.mark.asyncio
async def test_image_qa(tmp_path) -> None:
    litellm.num_retries = 8  # Mitigate connection-related failures
    settings = ImageQAEnvironment.make_base_settings()
    settings.agent = AgentSettings(
        agent_type="ldp.agent.SimpleAgent",
        index=IndexSettings(paper_directory=tmp_path),
        # TODO: add image support for paper_search
        tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
        agent_evidence_n=3,  # Bumped up to collect several perspectives
    )
    dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
    t_cb = StoreTrajectoriesCallback()
    m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
    evaluator = Evaluator(
        config=EvaluatorConfig(
            batch_size=256,  # Use batch size greater than FigQA size and TableQA size
            max_rollout_steps=18,  # Match aviary paper's PaperQA setting
        ),
        agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
        dataset=dataset,
        callbacks=[t_cb, m_cb],
    )
    await evaluator.evaluate()
    print(m_cb.eval_means)

References

[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.

[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aviary_labbench-0.35.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aviary_labbench-0.35.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file aviary_labbench-0.35.0.tar.gz.

File metadata

  • Download URL: aviary_labbench-0.35.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for aviary_labbench-0.35.0.tar.gz
Algorithm Hash digest
SHA256 42ae3f3b3ddc5037520050931952c063eb9c8e902d1e9ba5a58c62c17a46c6e2
MD5 40d266af32d5d10d70f2ff6343900d3b
BLAKE2b-256 d3c147702f484c01a22e43fe8cc8bbb1905b66f8e93d4f7e1c2e6045221f162e

See more details on using hashes here.

File details

Details for the file aviary_labbench-0.35.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aviary_labbench-0.35.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9000d9368bb60182d0f4c8f24405b0f02dddcc798c10838a873751d5be5da42
MD5 50fd0b3a0e1121bb16683138728a8759
BLAKE2b-256 57e23e4b4f8b79b219fb392038ee5c938386e28327221e18c10324558ce9b3ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page