Skip to main content

MetaBeeAI LLM Pipeline for PDF processing and data extraction

Project description

MetaBeeAI Literature Review Pipeline

A comprehensive pipeline for extracting, analyzing, and benchmarking structured information from scientific literature using Large Language Models and Vision AI.


Required API Accounts

Before starting, you need to set up the following API accounts:

Service Purpose Sign Up Cost
OpenAI LLM processing and evaluation platform.openai.com Pay-per-use (model dependent)
LandingLens API PDF text extraction with vision AI landing.ai US $0.03 per page

Setting Up API Keys

Create a .env file in the project root:

# Copy the example file
cp env.example .env

# Edit .env and add your keys:
OPENAI_API_KEY=sk-proj-...your_key_here
LANDING_AI_API_KEY=...your_key_here

The .env file is automatically excluded from git for security.


Quick Start

1. Install Dependencies

Choose the option that best fits your workflow.

Option A – Install from PyPI (recommended for using the CLI/package directly):

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Mac/Linux
# Or: venv\Scripts\activate     # Windows

# Install the published package
pip install metabeeai

# Verify the CLI is available
metabeeai --help

Option B – Install for development (editable install from the repo):

# Clone the repository if you have not already
git clone https://github.com/MetaBeeAI/MetaBeeAI.git
cd MetaBeeAI

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate        # Mac/Linux
# Or: venv\Scripts\activate     # Windows

# Install in editable mode with project dependencies
pip install -e .
# (Alternatively, pip install -r requirements.txt if you prefer)

2. Prepare Your PDFs

Organize papers in a folder with a subfolder called "papers" (best to do this in a location outside the repo folder) YOURDATABASE/papers/:

YOURDATABASE/papers/
├── 4YD2Y4J8/
│   └── 4YD2Y4J8_main.pdf
├── 76DQP2DC/
│   └── 76DQP2DC_main.pdf
└── ...

Each paper should be in its own folder with a unique alphanumeric ID.

3. Run the Pipeline

See the Complete Workflow section below for the full step-by-step process.


Pipeline Overview

The repository is now published as the metabeeai Python package. Every stage can be run directly from the command line via the metabeeai CLI (installed automatically with the package). The end-to-end flow is:

PDFs → Document Processing → LLM Extraction → Human Review → Benchmarking → Analysis
Stage Purpose CLI command(s) Python entry point
1. Document processing Convert PDFs into structured JSON chunks using Vision AI + merging + deduplication metabeeai process-pdfs metabeeai.process_pdfs.process_all
2. LLM question answering Ask the configured questions over the processed chunks and store answers metabeeai llm metabeeai.metabeeai_llm.llm_pipeline
3. Human review Launch the BeeGUI application to validate and rate answers metabeeai review metabeeai.llm_review_software.beegui
4. Benchmarking & QA Prepare benchmark data, run DeepEval, plot metrics, inspect edge cases metabeeai prep-benchmark, metabeeai benchmark, metabeeai plot-metrics, metabeeai edge-cases or metabeeai benchmark-all for the whole sequence metabeeai.llm_benchmarking.*
5. Downstream analysis Run domain-specific analyses (trends, networks, investigations) Python scripts in metabeeai.query_database (see docs) metabeeai.query_database.*

Each command accepts additional flags (see metabeeai --help). You can also run the underlying Python modules directly if you prefer to script the workflow, but the CLI provides the recommended interface for both production use and local experimentation.

For more detailed walkthroughs, refer to the documentation in docs/guide/ (particularly pipeline_overview.rst, workflow.rst, and benchmarking.rst).


Complete Workflow (CLI-first)

The metabeeai CLI exposes every stage. All commands automatically load configuration from config.py / environment variables unless you override them with flags.

1. Process PDFs to JSON

# Basic usage – processes every folder under your data directory
metabeeai process-pdfs

# Only merge/deduplicate previously processed chunks (skip expensive steps)
metabeeai process-pdfs --merge-only

# Process a subset of folders (alphabetical range) and skip API calls
metabeeai process-pdfs --start af20101 --end b2050e6 --skip-api

# Change root directory
metabeeai process-pdfs --dir /path/to/YOURDATABASE/papers

Purpose: Split PDFs, send pages to Vision AI, merge/deduplicate chunks Output: YOURDATABASE/papers/{paper_id}/pages/merged_v2.json Key options: --dir, --start/--end, --merge-only, --skip-split, --skip-api, --skip-merge, --skip-deduplicate, --filter-chunk-type, --pages


2. Extract Information with the LLM

# Default run (balanced settings from pipeline_config.py)
metabeeai llm

# Process specific paper folders
metabeeai llm --papers 4YD2Y4J8 76DQP2DC

# Override model choices
metabeeai llm --relevance-model openai/gpt-4o-mini --answer-model openai/gpt-4o

# Point to a different data directory
metabeeai llm --dir /path/to/YOURDATABASE/papers

Purpose: Run questions.yml against processed chunks and store answers Output: YOURDATABASE/papers/{paper_id}/answers.json Key options: --dir, --papers, --overwrite, --relevance-model, --answer-model, --config


3. Human Review

# Launch BeeGUI (requires desktop/Qt)
metabeeai review

Purpose: Inspect LLM answers alongside PDFs, capture reviewer edits/ratings Output: YOURDATABASE/papers/{paper_id}/answers_extended.json Notes: GUI allows filtering, editing, rating, adding reviewer notes


4. Benchmarking & Evaluation

This stage has individual commands plus a “run everything” helper.

4.1 Prepare benchmarking data

# Generate benchmark_data_gui.json (default paths)
metabeeai prep-benchmark

# Custom locations
metabeeai prep-benchmark --papers-dir /path/to/YOURDATABASE/papers \
                         --questions-yml /path/to/questions.yml \
                         --output /path/to/benchmark_data_gui.json

Purpose: Collate GUI-reviewed answers + LLM answers + retrieval context Output: YOURDATABASE/benchmark_data_gui.json (nested structure) Key options: --papers-dir, --questions-yml, --output

4.2 Run DeepEval benchmarking

# Basic run over all questions
metabeeai benchmark

# Filter by question and sample size, list available keys
metabeeai benchmark --list-questions
metabeeai benchmark --question bee_species --limit 5

# Adjust runtime/cost knobs
metabeeai benchmark --batch-size 10 --max-retries 3
metabeeai benchmark --use-retrieval-only
metabeeai benchmark --model gpt-4o-mini --max-context-length 150000

Purpose: Evaluate LLM answers vs reviewer answers using 5 metrics Output: YOURDATABASE/deepeval_results/combined_results_{question}_{timestamp}.json(.jsonl) Key options: --question, --input, --limit, --batch-size, --max-retries, --model, --max-context-length, --use-retrieval-only, --list-questions

4.3 Visualize metrics

# Create per-metric plots + summary plot
metabeeai plot-metrics

# Alternate locations
metabeeai plot-metrics --results-dir /custom/results --output-dir /custom/results

Purpose: Generate bar charts (mean ± standard error) per metric and a summary plot Output: YOURDATABASE/deepeval_results/plots/{metric}.png and summary_metrics.png Key options: --results-dir, --output-dir

4.4 Identify edge cases

# Default: bottom 3 per question
metabeeai edge-cases --num-cases 3

# Contextual-only run with custom directories and OpenAI key override
metabeeai edge-cases --contextual-only \
    --results-dir /custom/deepeval_results \
    --output-dir /custom/edge_cases \
    --openai-api-key sk-... \
    --model gpt-4o-mini

Purpose: Surface lowest-scoring papers, generate reports and optional LLM summaries Output: YOURDATABASE/edge_cases/ (combined JSON, markdown report, summaries) Key options: --num-cases, --results-dir, --output-dir, --openai-api-key, --model, --generate-summaries-only, --contextual-only, --generate-contextual-summaries-only

4.5 Run the entire benchmarking pipeline

# Prep + benchmark + plot + edge cases (default settings)
metabeeai benchmark-all

# Skip certain stages or pass through flags
metabeeai benchmark-all --skip-prep --skip-edge-cases --question bee_species --limit 5

This wrapper simply forwards the relevant options to the commands above. Use it when you want the full workflow in one go; use the individual commands for finer control.


Step 5: Data Analysis

cd query_database

# Analyze trends
python trend_analysis.py

# Network analysis
python network_analysis.py

# Investigate specific topics
python investigate_bee_species.py
python investigate_pesticides.py

Output: query_database/output/ (plots, reports, JSON data)


Project Structure

The repository is packaged under src/metabeeai. Key directories:

metabeeai/
├── pyproject.toml              # Packaging, dependencies, CLI entrypoints
├── README.md                   # This file
├── docs/                       # Sphinx documentation (see docs/guide/*)
├── examples/                   # Sample configs / usage snippets
├── tests/                      # CLI and unit tests
├── src/
│   └── metabeeai/
│       ├── __init__.py
│       ├── cli.py              # `metabeeai` console entrypoint
│       ├── config.py           # Shared helpers for locating data dirs
│       ├── process_pdfs/       # Stage 1 modules (process_all, split_pdf, etc.)
│       ├── metabeeai_llm/      # Stage 2 modules (llm_pipeline, questions.yml, …)
│       ├── llm_review_software/# Stage 3 GUI (beegui, annotator)
│       ├── llm_benchmarking/   # Stage 4 tools (prep_benchmark_data, deepeval, plots, edge cases, run_benchmarking)
│       └── query_database/     # Stage 5 analysis scripts (trend_analysis, network_analysis, investigations)
└── data/ (optional)            # Local data tree (see below)

Where to store your data

We recommend keeping paper data outside the repo checkout, e.g.:

YOURDATABASE/
└── papers/
    ├── 4YD2Y4J8/
    │   ├── 4YD2Y4J8_main.pdf
    │   ├── pages/merged_v2.json
    │   ├── answers.json
    │   └── answers_extended.json
    └── ...

Point the CLI at this directory via config.py, environment variables, or per-command --dir/--papers-dir flags.


Common Use Cases

Use Case 1: Process New Papers

# 1. Add PDFs to YOURDATABASE/papers/{paper_id}/
# 2. Run the document pipeline
metabeeai process-pdfs --dir /path/to/YOURDATABASE/papers

# 3. Extract answers with the balanced preset
metabeeai llm --dir /path/to/YOURDATABASE/papers --config balanced

Result: Each paper folder now contains an answers.json file with structured outputs


Use Case 2: Review LLM Answers

metabeeai review

Features:

  • View PDF alongside LLM answers
  • Edit and validate answers
  • Rate answer quality
  • Navigate between papers
  • Saves reviewer responses side-by-side with the original LLM output in YOURDATABASE/papers/{paper_id}/answers_extended.json

Use Case 3: Benchmark LLM Performance

# 1. Ensure reviewer confirmations exist (answers_extended.json per paper from BeeGUI)

# 2. Create benchmark dataset from GUI reviewer answers
metabeeai prep-benchmark \
  --papers-dir /path/to/YOURDATABASE/papers \
  --output /path/to/YOURDATABASE/benchmark_data_gui.json

# 3. Run evaluation (all questions or filtered)
metabeeai benchmark \
  --input /path/to/YOURDATABASE/benchmark_data_gui.json \
  --question species

# 4. Visualize metrics
metabeeai plot-metrics \
  --results-dir /path/to/YOURDATABASE/deepeval_results

# 5. Find problem papers (edge cases - can specify how many to include)
metabeeai edge-cases --num-cases 5 \
  --results-dir /path/to/YOURDATABASE/deepeval_results \
  --output-dir /path/to/YOURDATABASE/edge_cases

Result:

  • Performance metrics across 5 dimensions
  • Comparison plots
  • Edge case analysis

Use Case 4: Analyze Extracted Data

cd query_database

# Analyze trends
python trend_analysis.py

# Analyze relationships between variables
python network_analysis.py

Result: Plots and reports in query_database/output/


Question Definitions (questions.yml)

All question logic lives in src/metabeeai/metabeeai_llm/questions.yml. Each entry under the top-level QUESTIONS: key defines how the LLM should extract a specific piece of information. A typical block looks like this:

QUESTIONS:
  some_question_id:
    question: "Natural language prompt to send to the model"
    instructions:
      - "Step-by-step guidance on what to include/exclude"
      - "Each bullet is enforced before the answer."
    output_format: "Human-readable description of the expected formatting"
    example_output:
      - "Example answer 1"
      - "Example answer 2"
    bad_example_output:
      - "Examples of what NOT to return"
    no_info_response: "Fallback text when nothing is found"
    max_chunks: 5                 # (optional) throttle retrieval depth per question
    description: "Short note about retrieval threshold/purpose"

Field descriptions

Field Purpose
question The actual prompt sent to the LLM. Treats retrieved chunks as context.
instructions Ordered list of constraints/checklists. The LLM sees these before answering.
output_format Plain-language description of the formatting you expect (e.g., numbered list, JSON-like bullets).
example_output One or more positive examples showing ideal answers.
bad_example_output (Optional) Counter-examples to discourage common mistakes.
no_info_response Exact string returned when the pipeline cannot find relevant information.
max_chunks (Optional) The maximum number of retrieval chunks passed to the LLM for this question.
description (Optional) Human-readable comment about retrieval strictness, priority, etc.

You can add, remove, or edit question blocks to suit new projects (e.g., different species, stressors, experimental outputs). The LLM pipeline will automatically pick up any question_key listed under QUESTIONS as long as it has the required fields above. After editing questions.yml, rerun metabeeai llm (and downstream benchmarking if needed) to populate the new fields in each answers.json.


Model Selection

The LLM pipeline exposes model selection through the metabeeai llm CLI. Choose from presets or override models directly.

Preset configurations (recommended)

metabeeai llm --config fast      # gpt-4o-mini for relevance + answers
metabeeai llm --config balanced  # gpt-4o-mini for relevance, gpt-4o for answers
metabeeai llm --config quality   # gpt-4o for relevance + answers

Custom model override

metabeeai llm \
  --relevance-model openai/gpt-4o-mini \
  --answer-model openai/gpt-4o
Configuration Relevance model Answer model Primary goal
fast openai/gpt-4o-mini openai/gpt-4o-mini High throughput / low cost
balanced openai/gpt-4o-mini openai/gpt-4o Default mix of speed + accuracy
quality openai/gpt-4o openai/gpt-4o Maximum fidelity, slower

Configuration

Global Configuration (config.py)

Centralized configuration for all pipeline components:

from config import get_papers_dir, get_data_dir

# Get configured directories
papers_dir = get_papers_dir()  # Default: data/papers
data_dir = get_data_dir()      # Default: data

Environment Variables (set in .env):

  • METABEEAI_DATA_DIR - Base data directory (default: data)
  • OPENAI_API_KEY - OpenAI API key
  • LANDING_AI_API_KEY - LandingLens API key

Question Configuration (metabeeai_llm/questions.yml)

Define questions with:

  • Question text
  • Instructions for LLM
  • Expected output format
  • Examples (good and bad)
  • Retrieval parameters (max_chunks, min_score)

Benchmarking Metrics

The pipeline evaluates LLM performance using 5 metrics:

Standard DeepEval Metrics (3)

  1. Faithfulness (0-1, higher is better)

    • Measures if LLM answer contradicts source text
    • Perfect score: No hallucinations or contradictions
  2. Contextual Precision (0-1, higher is better)

    • Evaluates if relevant chunks are ranked highly
    • Perfect score: Most relevant chunks retrieved first
  3. Contextual Recall (0-1, higher is better)

    • Checks if expected answer is supported by retrieval
    • Perfect score: All key points have source support

G-Eval Metrics (2)

  1. Completeness (0-1, threshold: 0.5)

    • Assesses if answer covers all key points
    • Uses GPT-4o to evaluate against reviewer answer
  2. Accuracy (0-1, threshold: 0.5)

    • Evaluates information accuracy
    • Uses GPT-4o to compare LLM vs reviewer answers

Typical Performance (based on 10 primate welfare papers):

  • Standard metrics: 0.7-1.0 (good)
  • G-Eval metrics: 0.4-0.5 (moderate)

Cost Estimates

Based on typical usage with GPT-4o:

Task Papers Questions Cost
LLM Extraction 10 3 per paper ~$2-3
Benchmarking 10 3 questions ~$0.95
Edge Case Analysis 3 bottom papers All questions ~$0.05
TOTAL 10 papers Full pipeline ~$3-4

Cost Reduction Options:

  • Use --config fast instead of --config quality (3-5x cheaper)
  • Use --config balanced for optimal cost/quality trade-off
  • Process fewer papers initially for testing

Detailed Documentation

Each component has detailed documentation:

Component Documentation
PDF Processing process_pdfs/README.md
LLM Pipeline metabeeai_llm/README.md
Benchmarking llm_benchmarking/README.md
Data Analysis query_database/README.md

Tutorial: Process Your First 3 Papers

Complete Example

# 1. Set up environment (one-time)
python -m venv venv
source venv/bin/activate
pip install metabeeai
cp env.example .env  # fill in API keys

# 2. Add 3 PDFs under YOURDATABASE/papers/
cp your_paper.pdf /path/to/YOURDATABASE/papers/PAPER001/PAPER001_main.pdf
# Repeat for PAPER002, PAPER003

# 3. Process PDFs → merged_v2.json
metabeeai process-pdfs --dir /path/to/YOURDATABASE/papers

# 4. Run LLM extraction (balanced preset recommended)
metabeeai llm --dir /path/to/YOURDATABASE/papers --config balanced
# Output: answers.json per paper

# 5. Review answers (saves answers_extended.json)
metabeeai review

# 6. Create benchmark dataset from GUI reviews
metabeeai prep-benchmark \
  --papers-dir /path/to/YOURDATABASE/papers \
  --output /path/to/YOURDATABASE/benchmark_data_gui.json

# 7. Run evaluation (choose a question or all)
metabeeai benchmark \
  --input /path/to/YOURDATABASE/benchmark_data_gui.json \
  --question bee_species

# 8. Visualize metrics
metabeeai plot-metrics \
  --results-dir /path/to/YOURDATABASE/deepeval_results

# 9. Find problem papers
metabeeai edge-cases --num-cases 5 \
  --results-dir /path/to/YOURDATABASE/deepeval_results \
  --output-dir /path/to/YOURDATABASE/edge_cases

Expected time:

  • PDF processing: ~1-5 min per paper
  • LLM extraction: ~2-3 min per paper
  • Evaluation: <1 min per question

Troubleshooting

Common Issues

Issue: Module not found errors

# Solution: Activate virtual environment
source venv/bin/activate

Issue: API key errors

# Solution: Check .env file exists and has valid keys
cat .env

Issue: "Context too long" warnings

# Solution: Use faster models or reduce batch size
python llm_pipeline.py --config fast

Issue: Empty GUI window

# Solution: Check folder names are alphanumeric (not just numeric)
# The GUI now accepts folders like: 4YD2Y4J8, 76DQP2DC, etc.

Issue: UTF-8 BOM in CSV

# Solution: Scripts automatically handle BOM with utf-8-sig encoding
# If you see '\ufeff' in column names, the script handles this

Best Practices

1. Start Small

  • Test with 3-5 papers first
  • Use --limit flags to test scripts
  • Verify outputs before scaling up

2. Version Control

  • Results are timestamped (no overwrites)
  • Keep original answers.json files unchanged
  • Reviewer answers go in separate files

3. Cost Management

  • Use --config fast for initial testing
  • Use --config balanced for production runs
  • Test with specific papers using --papers before full runs

4. Quality Assurance

  • Review edge cases to identify patterns
  • Check low-scoring papers manually
  • Validate LLM answers with GUI tool

Additional Resources

Documentation

  • Full docs (installation, pipeline, API): https://metabeeai.readthedocs.io
  • Module-specific references remain in docs/ and the Read the Docs site (LLM benchmarking, PDF processing, LLM pipeline, review software, query database)

External Links


Project: MetaBeeAI Last Updated: November 14, 2025 Written by: Rachel Parkinson, Shuxiang Cao, Mikael Mieskolainen, Alasdair Wilson Contact: Rachel Parkinson r.parkinson@qmul.ac.uk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metabeeai-0.2.0.tar.gz (125.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metabeeai-0.2.0-py3-none-any.whl (126.3 kB view details)

Uploaded Python 3

File details

Details for the file metabeeai-0.2.0.tar.gz.

File metadata

  • Download URL: metabeeai-0.2.0.tar.gz
  • Upload date:
  • Size: 125.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 87854aa27cca0f19f87d6d0b0ce74f91c4ca3ea4d0f42541f6f437505beb63e6
MD5 71106b8b6a21ef0a66c20771fbe530e4
BLAKE2b-256 e20058bba5e44027df606a30630f867fd0df4f71d73cf05472db9ad8a11657f9

See more details on using hashes here.

File details

Details for the file metabeeai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: metabeeai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 126.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9b2f30ee1ef4625aa7109cd95e068ba8950c77de678d5746fda8aa8f88f50098
MD5 e79c6fb759649cee066d10b5dbdafa4b
BLAKE2b-256 a9928b0a5b10923d8d5cc52ef3f6deade868a68e75c633c91092ef09f7b315d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page