Skip to main content

LAB-Bench environments implemented with aviary

Project description

aviary.labbench

LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.

Installation

To install the LAB-Bench environment, run:

pip install 'fhaviary[labbench]'

Usage

In labbench/env.py, you will find:

  • GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.
  • ImageQAEnvironment: an GradablePaperQAEnvironment subclass for QA where image(s) are pre-added.

And in labbench/task.py, you will find:

  • TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create one GradablePaperQAEnvironment per question.
  • ImageQATaskDataset: a task dataset that pairs with ImageQAEnvironment for FigQA or TableQA.

Here is an example of how to use them:

import os

from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings

from aviary.env import TaskDataset


async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_litqa_v2_papers)
    dataset = TaskDataset.from_name("litqa2", settings=settings)
    metrics_callback = MeanMetricsCallback(eval_dataset=dataset)

    evaluator = Evaluator(
        config=EvaluatorConfig(batch_size=3),
        agent=SimpleAgent(),
        dataset=dataset,
        callbacks=[metrics_callback],
    )
    await evaluator.evaluate()
    print(metrics_callback.eval_means)

Image Question-Answer

This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.

from typing import cast

import litellm
import pytest
from ldp.agent import Agent
from ldp.alg import (
    Evaluator,
    EvaluatorConfig,
    MeanMetricsCallback,
    StoreTrajectoriesCallback,
)
from paperqa.settings import AgentSettings, IndexSettings

from aviary.envs.labbench import (
    ImageQAEnvironment,
    ImageQATaskDataset,
    LABBenchDatasets,
)


@pytest.mark.asyncio
async def test_image_qa(tmp_path) -> None:
    litellm.num_retries = 8  # Mitigate connection-related failures
    settings = ImageQAEnvironment.make_base_settings()
    settings.agent = AgentSettings(
        agent_type="ldp.agent.SimpleAgent",
        index=IndexSettings(paper_directory=tmp_path),
        # TODO: add image support for paper_search
        tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
        agent_evidence_n=3,  # Bumped up to collect several perspectives
    )
    dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
    t_cb = StoreTrajectoriesCallback()
    m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
    evaluator = Evaluator(
        config=EvaluatorConfig(
            batch_size=256,  # Use batch size greater than FigQA size and TableQA size
            max_rollout_steps=18,  # Match aviary paper's PaperQA setting
        ),
        agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
        dataset=dataset,
        callbacks=[t_cb, m_cb],
    )
    await evaluator.evaluate()
    print(m_cb.eval_means)

References

[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.

[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aviary_labbench-0.34.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aviary_labbench-0.34.0-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file aviary_labbench-0.34.0.tar.gz.

File metadata

  • Download URL: aviary_labbench-0.34.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aviary_labbench-0.34.0.tar.gz
Algorithm Hash digest
SHA256 70bdac4b2f183812891fc456b2608a90aed6ddeb150b0c8c2144c69b3152aa82
MD5 0dfc5abe03b206bb0d0d606f8bf11a99
BLAKE2b-256 99382a95980c8378de8ff13f155545da5939fb6cab3983da77f848b77cb2b8ca

See more details on using hashes here.

File details

Details for the file aviary_labbench-0.34.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aviary_labbench-0.34.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8489c648d2674067f5f1550b9f6b1572aee207eb9cb52e9278333506b37788a
MD5 9c46d6885f05c7cb45b37ab5d8607d87
BLAKE2b-256 fa0b697da61f1ced9c38d4761358a455ebd2f6539b1c83aa8b56179dee2214e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page