Skip to main content

MetaBeeAI LLM Pipeline for PDF processing and data extraction

Project description

MetaBeeAI Literature Review Pipeline

A comprehensive pipeline for extracting, analyzing, and benchmarking structured information from scientific literature using Large Language Models and Vision AI.


Required API Accounts

Before starting, you need to set up the following API accounts:

Service Purpose Sign Up Cost
OpenAI LLM processing and evaluation platform.openai.com Pay-per-use (~$1-5 per 10 papers)
LandingLens API PDF text extraction with vision AI landing.ai Contact for pricing

Setting Up API Keys

Create a .env file in the project root:

# Copy the example file
cp env.example .env

# Edit .env and add your keys:
OPENAI_API_KEY=sk-proj-...your_key_here
LANDING_AI_API_KEY=...your_key_here

The .env file is automatically excluded from git for security.


Quick Start

1. Install Dependencies

# Create virtual environment
python -m venv venv

# Activate environment
source venv/bin/activate  # Mac/Linux
# Or: venv\Scripts\activate  # Windows

# Install packages
pip install -r requirements.txt

2. Prepare Your PDFs

Organize papers in data/papers/:

data/papers/
├── 4YD2Y4J8/
│   └── 4YD2Y4J8_main.pdf
├── 76DQP2DC/
│   └── 76DQP2DC_main.pdf
└── ...

Each paper should be in its own folder with a unique alphanumeric ID.

3. Run the Pipeline

See the Complete Workflow section below for the full step-by-step process.


Pipeline Overview

The pipeline consists of 5 main stages:

PDFs → Vision AI Processing → LLM Analysis → Human Review → Benchmarking → Analysis

Stage 1: PDF Processing → Structured JSON

Folder: process_pdfs/
Input: PDF files
Output: JSON chunks with text and coordinates
Details: See process_pdfs/README.md

Stage 2: LLM Question Answering → Extracted Information

Folder: metabeeai_llm/
Input: JSON chunks
Output: Structured answers with citations
Details: See metabeeai_llm/README.md

Stage 3: Human Review & Annotation → Validated Answers

Folder: llm_review_software/
Input: LLM answers
Output: Human-validated answers
Details: GUI-based review interface

Stage 4: Benchmarking → Performance Metrics

Folder: llm_benchmarking/
Input: LLM + reviewer answers
Output: Evaluation metrics and comparisons
Details: See llm_benchmarking/README.md

Stage 5: Data Analysis → Insights

Folder: query_database/
Input: Structured answers across papers
Output: Trend analysis, network plots, summaries
Details: Query and aggregate data


Complete Workflow

Step 1: Process PDFs to JSON

cd process_pdfs
python process_all.py

What it does: Converts PDFs → structured JSON chunks
Output: data/papers/{paper_id}/pages/merged_v2.json
For details: process_pdfs/README.md


Step 2: Extract Information with LLM

cd metabeeai_llm

# Process all papers (uses default configuration)
python llm_pipeline.py

# Use predefined configurations (recommended) - look in metabeeai_llm/pipeline_config.py for details on these
python llm_pipeline.py --config balanced  # Fast relevance + high-quality answers
python llm_pipeline.py --config fast      # Fast & cheap processing
python llm_pipeline.py --config quality   # High quality for critical analysis

# Process specific papers
python llm_pipeline.py --folders 4YD2Y4J8 76DQP2DC

# Custom model selection
python llm_pipeline.py --relevance-model "openai/gpt-4o-mini" --answer-model "openai/gpt-4o"

What it does: LLM answers questions from questions.yml
Output: data/papers/{paper_id}/answers.json
Questions: Defined in metabeeai_llm/questions.yml
For details: metabeeai_llm/README.md


Step 3: Human Review (Optional)

cd llm_review_software
python beegui.py

What it does: GUI interface for reviewing and annotating LLM answers
Output: data/papers/{paper_id}/answers_extended.json
Features: View PDFs, edit answers, rate quality


Step 4: Benchmarking & Evaluation

4a. Prepare Reviewer Data

If you have CSV golden answers:

cd metabeeai_llm
python convert_goldens.py

Output: data/papers/{paper_id}/rev1_answers.json

If you used the GUI review tool, the data is already ready in answers_extended.json.

4b. Create Benchmark Dataset

For CSV reviewer answers:

cd llm_benchmarking
python prep_benchmark_data.py

For GUI reviewer answers:

python prep_benchmark_data_from_GUI_answers.py

Output: data/benchmark_data.json or data/benchmark_data_gui.json

4c. Run Evaluation

# Evaluate all questions
python deepeval_benchmarking.py --question design
python deepeval_benchmarking.py --question population
python deepeval_benchmarking.py --question welfare

# Or evaluate all at once
python deepeval_benchmarking.py

Output: deepeval_results/combined_results_{question}_{timestamp}.json
Cost: ~$0.95 for 10 papers × 3 questions
For details: llm_benchmarking/README.md

4d. Visualize Results

python plot_metrics_comparison.py

Output: deepeval_results/plots/metrics_comparison.png

4e. Identify Problem Papers (Optional)

# Get bottom 3 papers
python edge_cases.py --num-cases 3

Output: edge_cases/edge-case-report.md


Step 5: Data Analysis

cd query_database

# Analyze trends
python trend_analysis.py

# Network analysis
python network_analysis.py

# Investigate specific topics
python investigate_bee_species.py
python investigate_pesticides.py

Output: query_database/output/ (plots, reports, JSON data)


Project Structure

primate-welfare/
├── .env                        # API keys (create from env.example)
├── config.py                   # Centralized configuration
├── requirements.txt            # Python dependencies
│
├── data/                       # Data directory
│   ├── papers/                 # Paper-specific data
│   │   └── {paper_id}/
│   │       ├── {paper_id}_main.pdf          # Original PDF
│   │       ├── pages/
│   │       │   ├── main_p01.pdf.json        # Page JSONs
│   │       │   └── merged_v2.json           # Merged & deduplicated
│   │       ├── answers.json                 # LLM answers
│   │       ├── rev1_answers.json            # With CSV reviewer answers
│   │       └── answers_extended.json        # GUI reviewer answers
│   ├── golden_answers.csv      # CSV reviewer answers (input)
│   ├── benchmark_data.json     # Benchmark dataset
│   └── benchmark_data_gui.json # Benchmark dataset (GUI)
│
├── process_pdfs/               # Stage 1: PDF Processing
│   ├── README.md              # Detailed documentation
│   ├── process_all.py         # Main processing script
│   ├── split_pdf.py           # PDF splitting
│   ├── va_process_papers.py   # Vision AI extraction
│   ├── merger.py              # JSON merging
│   └── deduplicate_chunks.py  # Deduplication
│
├── metabeeai_llm/             # Stage 2: LLM Q&A
│   ├── README.md              # Detailed documentation
│   ├── llm_pipeline.py        # Main LLM pipeline
│   ├── questions.yml          # Question definitions
│   ├── convert_goldens.py     # CSV → JSON converter
│   └── json_multistage_qa.py  # Core LLM functions
│
├── llm_review_software/       # Stage 3: Human Review
│   ├── beegui.py              # GUI review interface
│   └── annotator.py           # Annotation logic
│
├── llm_benchmarking/          # Stage 4: Evaluation
│   ├── README.md              # Detailed documentation
│   ├── prep_benchmark_data.py # Prepare from CSV
│   ├── prep_benchmark_data_from_GUI_answers.py # Prepare from GUI
│   ├── deepeval_benchmarking.py # Run evaluation
│   ├── plot_metrics_comparison.py # Visualize results
│   ├── edge_cases.py          # Find problem papers
│   └── deepeval_results/      # Evaluation outputs
│       ├── combined_results_*.json
│       └── plots/
│
└── query_database/            # Stage 5: Data Analysis
    ├── README.md              # Analysis documentation
    ├── trend_analysis.py      # Temporal trends
    ├── network_analysis.py    # Relationship networks
    └── output/                # Analysis outputs

Common Use Cases

Use Case 1: Process New Papers

# 1. Add PDFs to data/papers/{paper_id}/
# 2. Process PDFs
cd process_pdfs
python process_all.py

# 3. Extract information (recommended: use balanced config)
cd ../metabeeai_llm
python llm_pipeline.py --config balanced

Result: Structured answers in answers.json for each paper


Use Case 2: Review LLM Answers

cd llm_review_software
python beegui.py

Features:

  • View PDF alongside LLM answers
  • Edit and validate answers
  • Rate answer quality
  • Navigate between papers

Use Case 3: Benchmark LLM Performance

# 1. Prepare reviewer answers (if from CSV)
cd metabeeai_llm
python convert_goldens.py

# 2. Create benchmark dataset
cd ../llm_benchmarking
python prep_benchmark_data.py

# 3. Run evaluation
python deepeval_benchmarking.py --question welfare

# 4. Visualize
python plot_metrics_comparison.py

# 5. Find problem papers
python edge_cases.py --num-cases 3

Result:

  • Performance metrics across 5 dimensions
  • Comparison plots
  • Edge case analysis

Use Case 4: Analyze Extracted Data

cd query_database

# Analyze welfare measure trends
python trend_analysis.py

# Analyze relationships between variables
python network_analysis.py

Result: Plots and reports in query_database/output/


Question Types

The pipeline currently handles three question types for primate welfare:

1. Design

Question: What is the overview of the study, the number of groups being monitored and the sample size?

Example Answer:

1. Overview: Compares wounding rates between groups, looking at impacts 
of age, group composition, and presence of young silverbacks; 
Groups: 45; n = 180

2. Population

Question: What species, sex, age range, mean age and SD, are studied? At what location and were they pair or group housed, and what was the social group composition?

Example Answer:

Species 1: western lowland Gorilla; sex: M and F; age range: 1-55 years; 
mean age: NA; location: USA (across 28 AZA accredited zoos); 
social group: Group; composition: Mixed-sex groups (n = 26; 41 males, 
91 females) and bachelor groups (n = 19; 48 males)

3. Welfare

Question: What are the measures of welfare used in the study, and has the link between the measure and welfare, wellbeing, or chronic stress been made?

Example Answer:

1. Measure: Wounding rates; Link made: Y; Welfare measure description: 
Rates of wounding over period of many years; Units: Wounds per gorilla 
per month; Collection method: Animal care staff recorded all wounds that 
occurred within groups using a standardized data sheet

Questions are fully defined in metabeeai_llm/questions.yml with instructions, examples, and configuration.


Model Selection

The LLM pipeline supports different model configurations for optimal performance:

Predefined Configurations (Recommended)

# Fast & cheap processing
python llm_pipeline.py --config fast

# Balanced speed and quality (recommended)
python llm_pipeline.py --config balanced

# High quality for critical analysis
python llm_pipeline.py --config quality

Custom Model Selection

# Specify individual models
python llm_pipeline.py --relevance-model "openai/gpt-4o-mini" --answer-model "openai/gpt-4o"
Configuration Relevance Model Answer Model Use Case
Fast gpt-4o-mini gpt-4o-mini High-volume processing, cost-sensitive
Balanced gpt-4o-mini gpt-4o Recommended for most use cases
Quality gpt-4o gpt-4o Critical analysis, maximum accuracy

Configuration

Global Configuration (config.py)

Centralized configuration for all pipeline components:

from config import get_papers_dir, get_data_dir

# Get configured directories
papers_dir = get_papers_dir()  # Default: data/papers
data_dir = get_data_dir()      # Default: data

Environment Variables (set in .env):

  • METABEEAI_DATA_DIR - Base data directory (default: data)
  • OPENAI_API_KEY - OpenAI API key
  • LANDING_AI_API_KEY - LandingLens API key

Question Configuration (metabeeai_llm/questions.yml)

Define questions with:

  • Question text
  • Instructions for LLM
  • Expected output format
  • Examples (good and bad)
  • Retrieval parameters (max_chunks, min_score)

Benchmarking Metrics

The pipeline evaluates LLM performance using 5 metrics:

Standard DeepEval Metrics (3)

  1. Faithfulness (0-1, higher is better)

    • Measures if LLM answer contradicts source text
    • Perfect score: No hallucinations or contradictions
  2. Contextual Precision (0-1, higher is better)

    • Evaluates if relevant chunks are ranked highly
    • Perfect score: Most relevant chunks retrieved first
  3. Contextual Recall (0-1, higher is better)

    • Checks if expected answer is supported by retrieval
    • Perfect score: All key points have source support

G-Eval Metrics (2)

  1. Completeness (0-1, threshold: 0.5)

    • Assesses if answer covers all key points
    • Uses GPT-4o to evaluate against reviewer answer
  2. Accuracy (0-1, threshold: 0.5)

    • Evaluates information accuracy
    • Uses GPT-4o to compare LLM vs reviewer answers

Typical Performance (based on 10 primate welfare papers):

  • Standard metrics: 0.7-1.0 (good)
  • G-Eval metrics: 0.4-0.5 (moderate)

Cost Estimates

Based on typical usage with GPT-4o:

Task Papers Questions Cost
LLM Extraction 10 3 per paper ~$2-3
Benchmarking 10 3 questions ~$0.95
Edge Case Analysis 3 bottom papers All questions ~$0.05
TOTAL 10 papers Full pipeline ~$3-4

Cost Reduction Options:

  • Use --config fast instead of --config quality (3-5x cheaper)
  • Use --config balanced for optimal cost/quality trade-off
  • Process fewer papers initially for testing

Detailed Documentation

Each component has detailed documentation:

Component Documentation
PDF Processing process_pdfs/README.md
LLM Pipeline metabeeai_llm/README.md
Benchmarking llm_benchmarking/README.md
Data Analysis query_database/README.md

Tutorial: Process Your First 3 Papers

Complete Example

# 1. Set up environment
source venv/bin/activate
cp env.example .env
# Edit .env with your API keys

# 2. Add 3 PDFs to data/papers/
mkdir -p data/papers/PAPER001
cp your_paper.pdf data/papers/PAPER001/PAPER001_main.pdf
# Repeat for PAPER002, PAPER003

# 3. Process PDFs
cd process_pdfs
python process_all.py
# Output: merged_v2.json for each paper

# 4. Run LLM extraction (recommended: balanced config)
cd ../metabeeai_llm
python llm_pipeline.py --config balanced
# Output: answers.json for each paper

# 5. Review answers (optional)
cd ../llm_review_software
python beegui.py
# Manually review and validate

# 6. If you have reviewer answers in CSV:
cd ../metabeeai_llm
python convert_goldens.py

# 7. Create benchmark dataset
cd ../llm_benchmarking
python prep_benchmark_data.py
# Output: data/benchmark_data.json

# 8. Run evaluation
python deepeval_benchmarking.py --question welfare
# Output: deepeval_results/combined_results_welfare_*.json

# 9. Visualize results
python plot_metrics_comparison.py
# Output: deepeval_results/plots/metrics_comparison.png

# 10. Find problem papers
python edge_cases.py --num-cases 2
# Output: edge_cases/edge-case-report.md

Expected time:

  • PDF processing: ~5-10 min per paper
  • LLM extraction: ~2-3 min per paper
  • Evaluation: ~1-2 min per question

Understanding the Output

LLM Answers (answers.json)

{
  "QUESTIONS": {
    "welfare": {
      "answer": "1. Measure: Wounding rates; Link made: Y; ...",
      "reason": "The study provides detailed information...",
      "chunk_ids": ["uuid1", "uuid2"]
    }
  }
}
  • answer: LLM's structured response
  • reason: Why this answer was generated
  • chunk_ids: Source text chunks used

Benchmark Results

{
  "paper_id": "4YD2Y4J8",
  "question_key": "welfare",
  "actual_output": "LLM answer",
  "expected_output": "Reviewer answer",
  "success": true/false,
  "metrics_data": [
    {
      "name": "Faithfulness",
      "score": 0.85,
      "success": true,
      "reason": "Explanation..."
    }
  ]
}
  • success: True if all metrics passed thresholds
  • metrics_data: Detailed results for each metric
  • Score interpretation: See llm_benchmarking/README.md

Troubleshooting

Common Issues

Issue: Module not found errors

# Solution: Activate virtual environment
source venv/bin/activate

Issue: API key errors

# Solution: Check .env file exists and has valid keys
cat .env

Issue: "Context too long" warnings

# Solution: Use faster models or reduce batch size
python llm_pipeline.py --config fast

Issue: Empty GUI window

# Solution: Check folder names are alphanumeric (not just numeric)
# The GUI now accepts folders like: 4YD2Y4J8, 76DQP2DC, etc.

Issue: UTF-8 BOM in CSV

# Solution: Scripts automatically handle BOM with utf-8-sig encoding
# If you see '\ufeff' in column names, the script handles this

Current Dataset

Primate Welfare Literature Review

  • Total Papers: 41 papers in data/papers/
  • With Golden Answers: 10 papers in data/golden_answers.csv
  • With GUI Answers: 1 paper with answers_extended.json
  • Questions: 3 per paper (design, population, welfare)
  • Species Covered: Gorillas, macaques, chimpanzees, bonobos, orangutans, lemurs, marmosets, slow lorises

Sample Papers:

  • 4YD2Y4J8: Western lowland gorilla wounding rates
  • 76DQP2DC: Rhesus macaque welfare and personality
  • WIZ9MV3T: Chimpanzee locomotion as wellbeing indicator
  • V7984AAU: Body condition score in slow lorises
  • 8BV8BLU8: Orangutan subjective wellbeing

Key Scripts Reference

PDF Processing

  • process_pdfs/process_all.py - Main processor

LLM Extraction

  • metabeeai_llm/llm_pipeline.py - Extract information from papers
    • --config {fast,balanced,quality} - Use predefined configurations
    • --relevance-model - Specify chunk selection model
    • --answer-model - Specify answer generation model
  • metabeeai_llm/convert_goldens.py - Convert CSV → JSON reviewer answers

Benchmarking

  • llm_benchmarking/prep_benchmark_data.py - Prepare benchmark dataset
  • llm_benchmarking/deepeval_benchmarking.py - Run evaluation (5 metrics)
  • llm_benchmarking/plot_metrics_comparison.py - Visualize results
  • llm_benchmarking/edge_cases.py - Find lowest-scoring papers

Review Interface

  • llm_review_software/beegui.py - GUI for reviewing answers

Best Practices

1. Start Small

  • Test with 3-5 papers first
  • Use --limit flags to test scripts
  • Verify outputs before scaling up

2. Version Control

  • Results are timestamped (no overwrites)
  • Keep original answers.json files unchanged
  • Reviewer answers go in separate files

3. Cost Management

  • Use --config fast for initial testing
  • Use --config balanced for production runs
  • Test with specific papers using --folders before full runs

4. Quality Assurance

  • Review edge cases to identify patterns
  • Check low-scoring papers manually
  • Validate LLM answers with GUI tool

Additional Resources

Documentation

  • LLM Benchmarking: llm_benchmarking/README.md (comprehensive guide)
  • PDF Processing: process_pdfs/README.md
  • LLM Pipeline: metabeeai_llm/README.md

External Links


Contributing

When adding new question types:

  1. Define in questions.yml:

    new_question:
      question: "Your question here?"
      instructions: [...]
      output_format: "..."
      example_output: [...]
      max_chunks: 6
      min_score: 0.4
    
  2. Update CSV template (if using CSV reviewers):

    • Add column for new question
    • Update convert_goldens.py to handle it
  3. Update question lists:

    • llm_benchmarking/llm_questions.txt
    • llm_benchmarking/edge_cases.py (question_types list)
  4. Re-run pipeline from Step 2


Support

For issues:

  1. Check relevant README in component folder
  2. Review error messages carefully
  3. Verify all input files exist
  4. Check API keys and credits
  5. Consult script-specific documentation

Project: MetaBeeAI - Bees & Pesticides
Version: 2.0
Last Updated: October 8, 2025
Written by: Rachel Parkinson, Shuxiang Cao, Mikael Mieskolainen Contact: See project documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metabeeai-0.1.0.tar.gz (115.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metabeeai-0.1.0-py3-none-any.whl (123.9 kB view details)

Uploaded Python 3

File details

Details for the file metabeeai-0.1.0.tar.gz.

File metadata

  • Download URL: metabeeai-0.1.0.tar.gz
  • Upload date:
  • Size: 115.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a79d0d3429c67cf5bacaaab2d40056597abd128c0f6f8d11a9d5a32fd0fec4ab
MD5 0d950da15d6116fe4fe249ca1b0dfbe9
BLAKE2b-256 2464da70ad812b2a69670909956217190c684b8ad933562ba162c1bab925dd7e

See more details on using hashes here.

File details

Details for the file metabeeai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: metabeeai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 123.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4c90d78e09d59dd32a25199740f2d86322ecfda4184410ffd25ff17e7e16921b
MD5 8691ea681934405ef425bd14dfe80586
BLAKE2b-256 bd03e1eaba2b94d3d1dbe41545145ea2d16a683fc16407ef4caf7d861fd4d5c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page