Skip to main content

Agentic framework for automated academic illustration generation

Project description

PaperBanana Logo

PaperBanana

Automated Academic Illustration for AI Scientists

Python 3.10+ arXiv License: MIT
Pydantic v2 Typer Gemini Free Tier


Disclaimer: This is an unofficial, community-driven open-source implementation of the paper "PaperBanana: Automating Academic Illustration for AI Scientists" by Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon (arXiv:2601.23265). This project is not affiliated with or endorsed by the original authors or Google Research. The implementation is based on the publicly available paper and may differ from the original system.

An agentic framework for generating publication-quality academic diagrams and statistical plots from text descriptions. Uses Google Gemini for both VLM and image generation.

  • Two-phase multi-agent pipeline with iterative refinement
  • Gemini-based VLM planning and image generation
  • CLI, Python API, and MCP server for IDE integration
  • Claude Code skills for /generate-diagram, /generate-plot, and /evaluate-diagram

PaperBanana takes paper as input and provide diagram as output


Quick Start

Prerequisites

  • Python 3.10+
  • A Google Gemini API key (available at no cost from Google AI Studio)

Step 1: Install

pip install paperbanana

Or install from source for development:

git clone https://github.com/llmsresearch/paperbanana.git
cd paperbanana
pip install -e ".[dev,google]"

Step 2: Get Your API Key

Run the interactive setup wizard:

paperbanana setup

This opens your browser to get a Google Gemini API key from Google AI Studio and saves it to .env.

Or set it up manually:

cp .env.example .env
# Edit .env and add: GOOGLE_API_KEY=your-key-here

Step 3: Generate a Diagram

# Using the included sample input
paperbanana generate \
  --input examples/sample_inputs/transformer_method.txt \
  --caption "Overview of our encoder-decoder architecture with sparse routing"

Or write your own methodology text:

cat > my_method.txt << 'EOF'
Our framework consists of an encoder that processes input sequences
through multi-head self-attention layers, followed by a decoder that
generates output tokens auto-regressively using cross-attention to
the encoder representations. We add a novel routing mechanism that
selects relevant encoder states for each decoder step.
EOF

paperbanana generate \
  --input my_method.txt \
  --caption "Overview of our encoder-decoder framework"

Output is saved to outputs/run_<timestamp>/final_output.png along with all intermediate iterations and metadata.


How It Works

PaperBanana implements a two-phase multi-agent pipeline with 5 specialized agents:

Phase 1 -- Linear Planning:

  1. Retriever selects the most relevant reference examples from a curated set of 13 methodology diagrams spanning agent/reasoning, vision/perception, generative/learning, and science/applications domains
  2. Planner generates a detailed textual description of the target diagram via in-context learning from the retrieved examples
  3. Stylist refines the description for visual aesthetics using NeurIPS-style guidelines (color palette, layout, typography)

Phase 2 -- Iterative Refinement (3 rounds):

  1. Visualizer renders the description into an image (Gemini 3 Pro for diagrams, Matplotlib code for plots)
  2. Critic evaluates the generated image against the source context and provides a revised description addressing any issues
  3. Steps 4-5 repeat for up to 3 iterations

Providers

Component Provider Model
VLM (planning, critique) Google Gemini gemini-2.0-flash
Image Generation Google Gemini gemini-3-pro-image-preview

CLI Reference

paperbanana generate -- Methodology Diagrams

paperbanana generate \
  --input method.txt \
  --caption "Overview of our framework" \
  --output diagram.png \
  --iterations 3
Flag Short Description
--input -i Path to methodology text file (required)
--caption -c Figure caption / communicative intent (required)
--output -o Output image path (default: auto-generated in outputs/)
--iterations -n Number of Visualizer-Critic refinement rounds
--vlm-provider VLM provider name (default: gemini)
--vlm-model VLM model name (default: gemini-2.0-flash)
--image-provider Image gen provider (default: google_imagen)
--image-model Image gen model (default: gemini-3-pro-image-preview)
--config Path to YAML config file (see configs/config.yaml)

paperbanana plot -- Statistical Plots

paperbanana plot \
  --data results.csv \
  --intent "Bar chart comparing model accuracy across benchmarks"
Flag Short Description
--data -d Path to data file, CSV or JSON (required)
--intent Communicative intent for the plot (required)
--output -o Output image path
--iterations -n Refinement iterations (default: 3)

paperbanana evaluate -- Quality Assessment

Comparative evaluation of a generated diagram against a human reference using VLM-as-a-Judge:

paperbanana evaluate \
  --generated diagram.png \
  --reference human_diagram.png \
  --context method.txt \
  --caption "Overview of our framework"
Flag Short Description
--generated -g Path to generated image (required)
--reference -r Path to human reference image (required)
--context Path to source context text file (required)
--caption -c Figure caption (required)

Scores on 4 dimensions (hierarchical aggregation per the paper):

  • Primary: Faithfulness, Readability
  • Secondary: Conciseness, Aesthetics

paperbanana setup -- First-Time Configuration

paperbanana setup

Interactive wizard that walks you through obtaining a Google Gemini API key and saving it to .env.


Python API

import asyncio
from paperbanana import PaperBananaPipeline, GenerationInput, DiagramType
from paperbanana.core.config import Settings

settings = Settings(
    vlm_provider="gemini",
    image_provider="google_imagen",
    refinement_iterations=3,
)

pipeline = PaperBananaPipeline(settings=settings)

result = asyncio.run(pipeline.generate(
    GenerationInput(
        source_context="Our framework consists of...",
        communicative_intent="Overview of the proposed method.",
        diagram_type=DiagramType.METHODOLOGY,
    )
))

print(f"Output: {result.image_path}")

See examples/generate_diagram.py and examples/generate_plot.py for complete working examples.


MCP Server

PaperBanana includes an MCP server for use with Claude Code, Cursor, or any MCP-compatible client. Add the following config to use it via uvx without a local clone:

{
  "mcpServers": {
    "paperbanana": {
      "command": "uvx",
      "args": ["--from", "paperbanana[mcp]", "paperbanana-mcp"],
      "env": { "GOOGLE_API_KEY": "your-google-api-key" }
    }
  }
}

Three MCP tools are exposed: generate_diagram, generate_plot, and evaluate_diagram.

The repo also ships with 3 Claude Code skills:

  • /generate-diagram <file> [caption] - generate a methodology diagram from a text file
  • /generate-plot <data-file> [intent] - generate a statistical plot from CSV/JSON data
  • /evaluate-diagram <generated> <reference> - evaluate a diagram against a human reference

See mcp_server/README.md for full setup details (Claude Code, Cursor, local development).


Configuration

Default settings are in configs/config.yaml. Override via CLI flags or a custom YAML:

paperbanana generate \
  --input method.txt \
  --caption "Overview" \
  --config my_config.yaml

Key settings:

vlm:
  provider: gemini
  model: gemini-2.0-flash

image:
  provider: google_imagen
  model: gemini-3-pro-image-preview

pipeline:
  num_retrieval_examples: 10
  refinement_iterations: 3
  output_resolution: "2k"

reference:
  path: data/reference_sets

output:
  dir: outputs
  save_iterations: true
  save_metadata: true

Project Structure

paperbanana/
├── paperbanana/
│   ├── core/          # Pipeline orchestration, types, config, utilities
│   ├── agents/        # Retriever, Planner, Stylist, Visualizer, Critic
│   ├── providers/     # VLM and image gen provider implementations
│   │   ├── vlm/       # Gemini VLM provider
│   │   └── image_gen/ # Gemini 3 Pro Image provider
│   ├── reference/     # Reference set management (13 curated examples)
│   ├── guidelines/    # Style guidelines loader
│   └── evaluation/    # VLM-as-Judge evaluation system
├── configs/           # YAML configuration files
├── prompts/           # Prompt templates for all 5 agents + evaluation
│   ├── diagram/       # retriever, planner, stylist, visualizer, critic
│   ├── plot/          # plot-specific prompt variants
│   └── evaluation/    # faithfulness, conciseness, readability, aesthetics
├── data/
│   ├── reference_sets/  # 13 verified methodology diagrams
│   └── guidelines/              # NeurIPS-style aesthetic guidelines
├── examples/          # Working example scripts + sample inputs
├── scripts/           # Data curation and build scripts
├── tests/             # Test suite (34 tests)
├── mcp_server/        # MCP server for IDE integration
└── .claude/skills/    # Claude Code skills (generate-diagram, generate-plot, evaluate-diagram)

Development

# Install with dev dependencies
pip install -e ".[dev,google]"

# Run tests
pytest tests/ -v

# Lint
ruff check paperbanana/ mcp_server/ tests/ scripts/

# Format
ruff format paperbanana/ mcp_server/ tests/ scripts/

Citation

This is an unofficial implementation. If you use this work, please cite the original paper:

@article{zhu2026paperbanana,
  title={PaperBanana: Automating Academic Illustration for AI Scientists},
  author={Zhu, Dawei and Meng, Rui and Song, Yale and Wei, Xiyu
          and Li, Sujian and Pfister, Tomas and Yoon, Jinsung},
  journal={arXiv preprint arXiv:2601.23265},
  year={2026}
}

Original paper: https://arxiv.org/abs/2601.23265

Disclaimer

This project is an independent open-source reimplementation based on the publicly available paper. It is not affiliated with, endorsed by, or connected to the original authors, Google Research, or Peking University in any way. The implementation may differ from the original system described in the paper. Use at your own discretion.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperbanana-0.1.0.tar.gz (16.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperbanana-0.1.0-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file paperbanana-0.1.0.tar.gz.

File metadata

  • Download URL: paperbanana-0.1.0.tar.gz
  • Upload date:
  • Size: 16.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for paperbanana-0.1.0.tar.gz
Algorithm Hash digest
SHA256 93ee790516427f49ae225db5c2e0f7f9bc9668488fab3800713e91f0aa878034
MD5 990dd6382616471e24c9e85d3a6d3417
BLAKE2b-256 f3778b74dcc5a4607838fdb90fbe948a8162602b487b1c5f60db93da02ecb69d

See more details on using hashes here.

File details

Details for the file paperbanana-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paperbanana-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for paperbanana-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1af1ad1648d38702bad93fa0f4f389887b95e04bd5d2429f8be304bea9866deb
MD5 b0bd201d9aa73008f368ecb0ad85b1d2
BLAKE2b-256 1c4477c03119da39ecb5c02d9812a56bcec18970010cc4a1f98ce789e8f797ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page