Skip to main content

LAB-Bench environments implemented with aviary

Project description

aviary.labbench

LAB-Bench environments implemented with aviary, allowing agents to perform question answering on scientific tasks.

Installation

To install the LAB-Bench environment, run:

pip install 'fhaviary[labbench]'

Usage

In labbench/env.py, you will find:

  • GradablePaperQAEnvironment: an PaperQA-backed environment that can grade answers given an evaluation function.
  • ImageQAEnvironment: an GradablePaperQAEnvironment subclass for QA where image(s) are pre-added.

And in labbench/task.py, you will find:

  • TextQATaskDataset: a task dataset designed to pull down FigQA, LitQA2, or TableQA from Hugging Face, and create one GradablePaperQAEnvironment per question.
  • ImageQATaskDataset: a task dataset that pairs with ImageQAEnvironment for FigQA or TableQA.

Here is an example of how to use them:

import os

from ldp.agent import SimpleAgent
from ldp.alg import Evaluator, EvaluatorConfig, MeanMetricsCallback
from paperqa import Settings

from aviary.env import TaskDataset


async def evaluate(folder_of_litqa_v2_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_litqa_v2_papers)
    dataset = TaskDataset.from_name("litqa2", settings=settings)
    metrics_callback = MeanMetricsCallback(eval_dataset=dataset)

    evaluator = Evaluator(
        config=EvaluatorConfig(batch_size=3),
        agent=SimpleAgent(),
        dataset=dataset,
        callbacks=[metrics_callback],
    )
    await evaluator.evaluate()
    print(metrics_callback.eval_means)

Image Question-Answer

This is an environment/dataset for giving PaperQA a Docs object with the image(s) for one LAB-Bench question. It's designed to be a comparison with zero-shotting the question to a LLM, but instead of a singular prompt the image is put through the PaperQA agent loop.

from typing import cast

import litellm
import pytest
from ldp.agent import Agent
from ldp.alg import (
    Evaluator,
    EvaluatorConfig,
    MeanMetricsCallback,
    StoreTrajectoriesCallback,
)
from paperqa.settings import AgentSettings, IndexSettings

from aviary.envs.labbench import (
    ImageQAEnvironment,
    ImageQATaskDataset,
    LABBenchDatasets,
)


@pytest.mark.asyncio
async def test_image_qa(tmp_path) -> None:
    litellm.num_retries = 8  # Mitigate connection-related failures
    settings = ImageQAEnvironment.make_base_settings()
    settings.agent = AgentSettings(
        agent_type="ldp.agent.SimpleAgent",
        index=IndexSettings(paper_directory=tmp_path),
        # TODO: add image support for paper_search
        tool_names={"gather_evidence", "gen_answer", "complete", "reset"},
        agent_evidence_n=3,  # Bumped up to collect several perspectives
    )
    dataset = ImageQATaskDataset(dataset=LABBenchDatasets.TABLE_QA, settings=settings)
    t_cb = StoreTrajectoriesCallback()
    m_cb = MeanMetricsCallback(eval_dataset=dataset, track_tool_usage=True)
    evaluator = Evaluator(
        config=EvaluatorConfig(
            batch_size=256,  # Use batch size greater than FigQA size and TableQA size
            max_rollout_steps=18,  # Match aviary paper's PaperQA setting
        ),
        agent=cast(Agent, await settings.make_ldp_agent(settings.agent.agent_type)),
        dataset=dataset,
        callbacks=[t_cb, m_cb],
    )
    await evaluator.evaluate()
    print(m_cb.eval_means)

References

[1] Skarlinski et al. Language agents achieve superhuman synthesis of scientific knowledge. ArXiv:2409.13740, 2024.

[2] Laurent et al. LAB-Bench: Measuring Capabilities of Language Models for Biology Research. ArXiv:2407.10362, 2024.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aviary_labbench-0.33.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aviary_labbench-0.33.0-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file aviary_labbench-0.33.0.tar.gz.

File metadata

  • Download URL: aviary_labbench-0.33.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aviary_labbench-0.33.0.tar.gz
Algorithm Hash digest
SHA256 8ac52d76aed818e1ad36b71fda29678297285c9c80d8c939ac42e4cc7c3c9430
MD5 01a64c902b7541754633da1d1fbe69bc
BLAKE2b-256 b485b3447562e5853477e8110489ad16754375377bb45d57843a8ccd6a803b51

See more details on using hashes here.

File details

Details for the file aviary_labbench-0.33.0-py3-none-any.whl.

File metadata

File hashes

Hashes for aviary_labbench-0.33.0-py3-none-any.whl
Algorithm Hash digest
SHA256 803d6f99b13bb3598e2db65866611ab42dfe8e14701ebe4525ae0951bf69a019
MD5 a0465f178c4556ae4f7074aa89b299a7
BLAKE2b-256 9c78c8b299c7a51d523e3eb31b43565ada1bd1a995e053cec5da6495019f6f27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page