MetaBeeAI LLM Pipeline for PDF processing and data extraction

These details have not been verified by PyPI

Project links

Project description

MetaBeeAI Literature Review Pipeline

A comprehensive pipeline for extracting, analyzing, and benchmarking structured information from scientific literature using Large Language Models and Vision AI.

Required API Accounts

Before starting, you need to set up the following API accounts:

Service	Purpose	Sign Up	Cost
OpenAI	LLM processing and evaluation	platform.openai.com	Pay-per-use (~$1-5 per 10 papers)
LandingLens API	PDF text extraction with vision AI	landing.ai	Contact for pricing

Setting Up API Keys

Create a .env file in the project root:

# Copy the example file
cp env.example .env

# Edit .env and add your keys:
OPENAI_API_KEY=sk-proj-...your_key_here
LANDING_AI_API_KEY=...your_key_here

The .env file is automatically excluded from git for security.

Quick Start

1. Install Dependencies

# Create virtual environment
python -m venv venv

# Activate environment
source venv/bin/activate  # Mac/Linux
# Or: venv\Scripts\activate  # Windows

# Install packages
pip install -r requirements.txt

2. Prepare Your PDFs

Organize papers in data/papers/:

data/papers/
├── 4YD2Y4J8/
│   └── 4YD2Y4J8_main.pdf
├── 76DQP2DC/
│   └── 76DQP2DC_main.pdf
└── ...

Each paper should be in its own folder with a unique alphanumeric ID.

3. Run the Pipeline

See the Complete Workflow section below for the full step-by-step process.

Pipeline Overview

The pipeline consists of 5 main stages:

PDFs → Vision AI Processing → LLM Analysis → Human Review → Benchmarking → Analysis

Stage 1: PDF Processing → Structured JSON

Folder: process_pdfs/
Input: PDF files
Output: JSON chunks with text and coordinates
Details: See process_pdfs/README.md

Stage 2: LLM Question Answering → Extracted Information

Folder: metabeeai_llm/
Input: JSON chunks
Output: Structured answers with citations
Details: See metabeeai_llm/README.md

Stage 3: Human Review & Annotation → Validated Answers

Folder: llm_review_software/
Input: LLM answers
Output: Human-validated answers
Details: GUI-based review interface

Stage 4: Benchmarking → Performance Metrics

Folder: llm_benchmarking/
Input: LLM + reviewer answers
Output: Evaluation metrics and comparisons
Details: See llm_benchmarking/README.md

Stage 5: Data Analysis → Insights

Folder: query_database/
Input: Structured answers across papers
Output: Trend analysis, network plots, summaries
Details: Query and aggregate data

Complete Workflow

Step 1: Process PDFs to JSON

cd process_pdfs
python process_all.py

What it does: Converts PDFs → structured JSON chunks
Output: data/papers/{paper_id}/pages/merged_v2.json
For details: process_pdfs/README.md

Step 2: Extract Information with LLM

cd metabeeai_llm

# Process all papers (uses default configuration)
python llm_pipeline.py

# Use predefined configurations (recommended) - look in metabeeai_llm/pipeline_config.py for details on these
python llm_pipeline.py --config balanced  # Fast relevance + high-quality answers
python llm_pipeline.py --config fast      # Fast & cheap processing
python llm_pipeline.py --config quality   # High quality for critical analysis

# Process specific papers
python llm_pipeline.py --folders 4YD2Y4J8 76DQP2DC

# Custom model selection
python llm_pipeline.py --relevance-model "openai/gpt-4o-mini" --answer-model "openai/gpt-4o"

What it does: LLM answers questions from questions.yml
Output: data/papers/{paper_id}/answers.json
Questions: Defined in metabeeai_llm/questions.yml
For details: metabeeai_llm/README.md

Step 3: Human Review (Optional)

cd llm_review_software
python beegui.py

What it does: GUI interface for reviewing and annotating LLM answers
Output: data/papers/{paper_id}/answers_extended.json
Features: View PDFs, edit answers, rate quality

Step 4: Benchmarking & Evaluation

4a. Prepare Reviewer Data

If you have CSV golden answers:

cd metabeeai_llm
python convert_goldens.py

Output: data/papers/{paper_id}/rev1_answers.json

If you used the GUI review tool, the data is already ready in answers_extended.json.

4b. Create Benchmark Dataset

For CSV reviewer answers:

cd llm_benchmarking
python prep_benchmark_data.py

For GUI reviewer answers:

python prep_benchmark_data_from_GUI_answers.py

Output: data/benchmark_data.json or data/benchmark_data_gui.json

4c. Run Evaluation

# Evaluate all questions
python deepeval_benchmarking.py --question design
python deepeval_benchmarking.py --question population
python deepeval_benchmarking.py --question welfare

# Or evaluate all at once
python deepeval_benchmarking.py

Output: deepeval_results/combined_results_{question}_{timestamp}.json
Cost: ~$0.95 for 10 papers × 3 questions
For details: llm_benchmarking/README.md

4d. Visualize Results

python plot_metrics_comparison.py

Output: deepeval_results/plots/metrics_comparison.png

4e. Identify Problem Papers (Optional)

# Get bottom 3 papers
python edge_cases.py --num-cases 3

Output: edge_cases/edge-case-report.md

Step 5: Data Analysis

cd query_database

# Analyze trends
python trend_analysis.py

# Network analysis
python network_analysis.py

# Investigate specific topics
python investigate_bee_species.py
python investigate_pesticides.py

Output: query_database/output/ (plots, reports, JSON data)

Project Structure

primate-welfare/
├── .env                        # API keys (create from env.example)
├── config.py                   # Centralized configuration
├── requirements.txt            # Python dependencies
│
├── data/                       # Data directory
│   ├── papers/                 # Paper-specific data
│   │   └── {paper_id}/
│   │       ├── {paper_id}_main.pdf          # Original PDF
│   │       ├── pages/
│   │       │   ├── main_p01.pdf.json        # Page JSONs
│   │       │   └── merged_v2.json           # Merged & deduplicated
│   │       ├── answers.json                 # LLM answers
│   │       ├── rev1_answers.json            # With CSV reviewer answers
│   │       └── answers_extended.json        # GUI reviewer answers
│   ├── golden_answers.csv      # CSV reviewer answers (input)
│   ├── benchmark_data.json     # Benchmark dataset
│   └── benchmark_data_gui.json # Benchmark dataset (GUI)
│
├── process_pdfs/               # Stage 1: PDF Processing
│   ├── README.md              # Detailed documentation
│   ├── process_all.py         # Main processing script
│   ├── split_pdf.py           # PDF splitting
│   ├── va_process_papers.py   # Vision AI extraction
│   ├── merger.py              # JSON merging
│   └── deduplicate_chunks.py  # Deduplication
│
├── metabeeai_llm/             # Stage 2: LLM Q&A
│   ├── README.md              # Detailed documentation
│   ├── llm_pipeline.py        # Main LLM pipeline
│   ├── questions.yml          # Question definitions
│   ├── convert_goldens.py     # CSV → JSON converter
│   └── json_multistage_qa.py  # Core LLM functions
│
├── llm_review_software/       # Stage 3: Human Review
│   ├── beegui.py              # GUI review interface
│   └── annotator.py           # Annotation logic
│
├── llm_benchmarking/          # Stage 4: Evaluation
│   ├── README.md              # Detailed documentation
│   ├── prep_benchmark_data.py # Prepare from CSV
│   ├── prep_benchmark_data_from_GUI_answers.py # Prepare from GUI
│   ├── deepeval_benchmarking.py # Run evaluation
│   ├── plot_metrics_comparison.py # Visualize results
│   ├── edge_cases.py          # Find problem papers
│   └── deepeval_results/      # Evaluation outputs
│       ├── combined_results_*.json
│       └── plots/
│
└── query_database/            # Stage 5: Data Analysis
    ├── README.md              # Analysis documentation
    ├── trend_analysis.py      # Temporal trends
    ├── network_analysis.py    # Relationship networks
    └── output/                # Analysis outputs

Common Use Cases

Use Case 1: Process New Papers

# 1. Add PDFs to data/papers/{paper_id}/
# 2. Process PDFs
cd process_pdfs
python process_all.py

# 3. Extract information (recommended: use balanced config)
cd ../metabeeai_llm
python llm_pipeline.py --config balanced

Result: Structured answers in answers.json for each paper

Use Case 2: Review LLM Answers

cd llm_review_software
python beegui.py

Features:

View PDF alongside LLM answers
Edit and validate answers
Rate answer quality
Navigate between papers

Use Case 3: Benchmark LLM Performance

# 1. Prepare reviewer answers (if from CSV)
cd metabeeai_llm
python convert_goldens.py

# 2. Create benchmark dataset
cd ../llm_benchmarking
python prep_benchmark_data.py

# 3. Run evaluation
python deepeval_benchmarking.py --question welfare

# 4. Visualize
python plot_metrics_comparison.py

# 5. Find problem papers
python edge_cases.py --num-cases 3

Result:

Performance metrics across 5 dimensions
Comparison plots
Edge case analysis

Use Case 4: Analyze Extracted Data

cd query_database

# Analyze welfare measure trends
python trend_analysis.py

# Analyze relationships between variables
python network_analysis.py

Result: Plots and reports in query_database/output/

Question Types

The pipeline currently handles three question types for primate welfare:

1. Design

Question: What is the overview of the study, the number of groups being monitored and the sample size?

Example Answer:

1. Overview: Compares wounding rates between groups, looking at impacts 
of age, group composition, and presence of young silverbacks; 
Groups: 45; n = 180

2. Population

Question: What species, sex, age range, mean age and SD, are studied? At what location and were they pair or group housed, and what was the social group composition?

Example Answer:

Species 1: western lowland Gorilla; sex: M and F; age range: 1-55 years; 
mean age: NA; location: USA (across 28 AZA accredited zoos); 
social group: Group; composition: Mixed-sex groups (n = 26; 41 males, 
91 females) and bachelor groups (n = 19; 48 males)

3. Welfare

Question: What are the measures of welfare used in the study, and has the link between the measure and welfare, wellbeing, or chronic stress been made?

Example Answer:

1. Measure: Wounding rates; Link made: Y; Welfare measure description: 
Rates of wounding over period of many years; Units: Wounds per gorilla 
per month; Collection method: Animal care staff recorded all wounds that 
occurred within groups using a standardized data sheet

Questions are fully defined in metabeeai_llm/questions.yml with instructions, examples, and configuration.

Model Selection

The LLM pipeline supports different model configurations for optimal performance:

Predefined Configurations (Recommended)

# Fast & cheap processing
python llm_pipeline.py --config fast

# Balanced speed and quality (recommended)
python llm_pipeline.py --config balanced

# High quality for critical analysis
python llm_pipeline.py --config quality

Custom Model Selection

# Specify individual models
python llm_pipeline.py --relevance-model "openai/gpt-4o-mini" --answer-model "openai/gpt-4o"

Configuration	Relevance Model	Answer Model	Use Case
Fast	`gpt-4o-mini`	`gpt-4o-mini`	High-volume processing, cost-sensitive
Balanced	`gpt-4o-mini`	`gpt-4o`	Recommended for most use cases
Quality	`gpt-4o`	`gpt-4o`	Critical analysis, maximum accuracy

Configuration

Global Configuration (`config.py`)

Centralized configuration for all pipeline components:

from config import get_papers_dir, get_data_dir

# Get configured directories
papers_dir = get_papers_dir()  # Default: data/papers
data_dir = get_data_dir()      # Default: data

Environment Variables (set in .env):

METABEEAI_DATA_DIR - Base data directory (default: data)
OPENAI_API_KEY - OpenAI API key
LANDING_AI_API_KEY - LandingLens API key

Question Configuration (`metabeeai_llm/questions.yml`)

Define questions with:

Question text
Instructions for LLM
Expected output format
Examples (good and bad)
Retrieval parameters (max_chunks, min_score)

Benchmarking Metrics

The pipeline evaluates LLM performance using 5 metrics:

Standard DeepEval Metrics (3)

Faithfulness (0-1, higher is better)
- Measures if LLM answer contradicts source text
- Perfect score: No hallucinations or contradictions
Contextual Precision (0-1, higher is better)
- Evaluates if relevant chunks are ranked highly
- Perfect score: Most relevant chunks retrieved first
Contextual Recall (0-1, higher is better)
- Checks if expected answer is supported by retrieval
- Perfect score: All key points have source support

G-Eval Metrics (2)

Completeness (0-1, threshold: 0.5)
- Assesses if answer covers all key points
- Uses GPT-4o to evaluate against reviewer answer
Accuracy (0-1, threshold: 0.5)
- Evaluates information accuracy
- Uses GPT-4o to compare LLM vs reviewer answers

Typical Performance (based on 10 primate welfare papers):

Standard metrics: 0.7-1.0 (good)
G-Eval metrics: 0.4-0.5 (moderate)

Cost Estimates

Based on typical usage with GPT-4o:

Task	Papers	Questions	Cost
LLM Extraction	10	3 per paper	~$2-3
Benchmarking	10	3 questions	~$0.95
Edge Case Analysis	3 bottom papers	All questions	~$0.05
TOTAL	10 papers	Full pipeline	~$3-4

Cost Reduction Options:

Use --config fast instead of --config quality (3-5x cheaper)
Use --config balanced for optimal cost/quality trade-off
Process fewer papers initially for testing

Detailed Documentation

Each component has detailed documentation:

Component	Documentation
PDF Processing	`process_pdfs/README.md`
LLM Pipeline	`metabeeai_llm/README.md`
Benchmarking	`llm_benchmarking/README.md`
Data Analysis	`query_database/README.md`

Tutorial: Process Your First 3 Papers

Complete Example

# 1. Set up environment
source venv/bin/activate
cp env.example .env
# Edit .env with your API keys

# 2. Add 3 PDFs to data/papers/
mkdir -p data/papers/PAPER001
cp your_paper.pdf data/papers/PAPER001/PAPER001_main.pdf
# Repeat for PAPER002, PAPER003

# 3. Process PDFs
cd process_pdfs
python process_all.py
# Output: merged_v2.json for each paper

# 4. Run LLM extraction (recommended: balanced config)
cd ../metabeeai_llm
python llm_pipeline.py --config balanced
# Output: answers.json for each paper

# 5. Review answers (optional)
cd ../llm_review_software
python beegui.py
# Manually review and validate

# 6. If you have reviewer answers in CSV:
cd ../metabeeai_llm
python convert_goldens.py

# 7. Create benchmark dataset
cd ../llm_benchmarking
python prep_benchmark_data.py
# Output: data/benchmark_data.json

# 8. Run evaluation
python deepeval_benchmarking.py --question welfare
# Output: deepeval_results/combined_results_welfare_*.json

# 9. Visualize results
python plot_metrics_comparison.py
# Output: deepeval_results/plots/metrics_comparison.png

# 10. Find problem papers
python edge_cases.py --num-cases 2
# Output: edge_cases/edge-case-report.md

Expected time:

PDF processing: ~5-10 min per paper
LLM extraction: ~2-3 min per paper
Evaluation: ~1-2 min per question

Understanding the Output

LLM Answers (`answers.json`)

{
  "QUESTIONS": {
    "welfare": {
      "answer": "1. Measure: Wounding rates; Link made: Y; ...",
      "reason": "The study provides detailed information...",
      "chunk_ids": ["uuid1", "uuid2"]
    }
  }
}

answer: LLM's structured response
reason: Why this answer was generated
chunk_ids: Source text chunks used

Benchmark Results

{
  "paper_id": "4YD2Y4J8",
  "question_key": "welfare",
  "actual_output": "LLM answer",
  "expected_output": "Reviewer answer",
  "success": true/false,
  "metrics_data": [
    {
      "name": "Faithfulness",
      "score": 0.85,
      "success": true,
      "reason": "Explanation..."
    }
  ]
}

success: True if all metrics passed thresholds
metrics_data: Detailed results for each metric
Score interpretation: See llm_benchmarking/README.md

Troubleshooting

Common Issues

Issue: Module not found errors

# Solution: Activate virtual environment
source venv/bin/activate

Issue: API key errors

# Solution: Check .env file exists and has valid keys
cat .env

Issue: "Context too long" warnings

# Solution: Use faster models or reduce batch size
python llm_pipeline.py --config fast

Issue: Empty GUI window

# Solution: Check folder names are alphanumeric (not just numeric)
# The GUI now accepts folders like: 4YD2Y4J8, 76DQP2DC, etc.

Issue: UTF-8 BOM in CSV

# Solution: Scripts automatically handle BOM with utf-8-sig encoding
# If you see '\ufeff' in column names, the script handles this

Current Dataset

Primate Welfare Literature Review

Total Papers: 41 papers in data/papers/
With Golden Answers: 10 papers in data/golden_answers.csv
With GUI Answers: 1 paper with answers_extended.json
Questions: 3 per paper (design, population, welfare)
Species Covered: Gorillas, macaques, chimpanzees, bonobos, orangutans, lemurs, marmosets, slow lorises

Sample Papers:

4YD2Y4J8: Western lowland gorilla wounding rates
76DQP2DC: Rhesus macaque welfare and personality
WIZ9MV3T: Chimpanzee locomotion as wellbeing indicator
V7984AAU: Body condition score in slow lorises
8BV8BLU8: Orangutan subjective wellbeing

Key Scripts Reference

PDF Processing

process_pdfs/process_all.py - Main processor

LLM Extraction

metabeeai_llm/llm_pipeline.py - Extract information from papers
- --config {fast,balanced,quality} - Use predefined configurations
- --relevance-model - Specify chunk selection model
- --answer-model - Specify answer generation model
metabeeai_llm/convert_goldens.py - Convert CSV → JSON reviewer answers

Benchmarking

llm_benchmarking/prep_benchmark_data.py - Prepare benchmark dataset
llm_benchmarking/deepeval_benchmarking.py - Run evaluation (5 metrics)
llm_benchmarking/plot_metrics_comparison.py - Visualize results
llm_benchmarking/edge_cases.py - Find lowest-scoring papers

Review Interface

llm_review_software/beegui.py - GUI for reviewing answers

Best Practices

1. Start Small

Test with 3-5 papers first
Use --limit flags to test scripts
Verify outputs before scaling up

2. Version Control

Results are timestamped (no overwrites)
Keep original answers.json files unchanged
Reviewer answers go in separate files

3. Cost Management

Use --config fast for initial testing
Use --config balanced for production runs
Test with specific papers using --folders before full runs

4. Quality Assurance

Review edge cases to identify patterns
Check low-scoring papers manually
Validate LLM answers with GUI tool

Additional Resources

Documentation

LLM Benchmarking: llm_benchmarking/README.md (comprehensive guide)
PDF Processing: process_pdfs/README.md
LLM Pipeline: metabeeai_llm/README.md

External Links

DeepEval Docs: https://docs.confident-ai.com/
OpenAI API: https://platform.openai.com/docs
Landing AI: https://landing.ai/

Contributing

When adding new question types:

Define in questions.yml:

new_question:
  question: "Your question here?"
  instructions: [...]
  output_format: "..."
  example_output: [...]
  max_chunks: 6
  min_score: 0.4

Update CSV template (if using CSV reviewers):
- Add column for new question
- Update convert_goldens.py to handle it
Update question lists:
- llm_benchmarking/llm_questions.txt
- llm_benchmarking/edge_cases.py (question_types list)
Re-run pipeline from Step 2

Support

For issues:

Check relevant README in component folder
Review error messages carefully
Verify all input files exist
Check API keys and credits
Consult script-specific documentation

Project: MetaBeeAI - Bees & Pesticides
Version: 2.0
Last Updated: October 8, 2025
Written by: Rachel Parkinson, Shuxiang Cao, Mikael Mieskolainen Contact: See project documentation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Jan 26, 2026

This version

0.1.0

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metabeeai-0.1.0.tar.gz (115.8 kB view details)

Uploaded Nov 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

metabeeai-0.1.0-py3-none-any.whl (123.9 kB view details)

Uploaded Nov 13, 2025 Python 3

File details

Details for the file metabeeai-0.1.0.tar.gz.

File metadata

Download URL: metabeeai-0.1.0.tar.gz
Upload date: Nov 13, 2025
Size: 115.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a79d0d3429c67cf5bacaaab2d40056597abd128c0f6f8d11a9d5a32fd0fec4ab`
MD5	`0d950da15d6116fe4fe249ca1b0dfbe9`
BLAKE2b-256	`2464da70ad812b2a69670909956217190c684b8ad933562ba162c1bab925dd7e`

See more details on using hashes here.

File details

Details for the file metabeeai-0.1.0-py3-none-any.whl.

File metadata

Download URL: metabeeai-0.1.0-py3-none-any.whl
Upload date: Nov 13, 2025
Size: 123.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for metabeeai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c90d78e09d59dd32a25199740f2d86322ecfda4184410ffd25ff17e7e16921b`
MD5	`8691ea681934405ef425bd14dfe80586`
BLAKE2b-256	`bd03e1eaba2b94d3d1dbe41545145ea2d16a683fc16407ef4caf7d861fd4d5c2`

See more details on using hashes here.

metabeeai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MetaBeeAI Literature Review Pipeline

Required API Accounts

Setting Up API Keys

Quick Start

1. Install Dependencies

2. Prepare Your PDFs

3. Run the Pipeline

Pipeline Overview

Stage 1: PDF Processing → Structured JSON

Stage 2: LLM Question Answering → Extracted Information

Stage 3: Human Review & Annotation → Validated Answers

Stage 4: Benchmarking → Performance Metrics

Stage 5: Data Analysis → Insights

Complete Workflow

Step 1: Process PDFs to JSON

Step 2: Extract Information with LLM

Step 3: Human Review (Optional)

Step 4: Benchmarking & Evaluation

4a. Prepare Reviewer Data

4b. Create Benchmark Dataset

4c. Run Evaluation

4d. Visualize Results

4e. Identify Problem Papers (Optional)

Step 5: Data Analysis

Project Structure

Common Use Cases

Use Case 1: Process New Papers

Use Case 2: Review LLM Answers

Use Case 3: Benchmark LLM Performance

Use Case 4: Analyze Extracted Data

Question Types

1. Design

2. Population

3. Welfare

Model Selection

Predefined Configurations (Recommended)

Custom Model Selection

Configuration

Global Configuration (config.py)

Question Configuration (metabeeai_llm/questions.yml)

Benchmarking Metrics

Standard DeepEval Metrics (3)

G-Eval Metrics (2)

Cost Estimates

Detailed Documentation

Tutorial: Process Your First 3 Papers

Complete Example

Understanding the Output

LLM Answers (answers.json)

Benchmark Results

Troubleshooting

Common Issues

Current Dataset

Key Scripts Reference

PDF Processing

LLM Extraction

Benchmarking

Review Interface

Best Practices

1. Start Small

2. Version Control

3. Cost Management

4. Quality Assurance

Additional Resources

Documentation

External Links

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Global Configuration (`config.py`)

Question Configuration (`metabeeai_llm/questions.yml`)

LLM Answers (`answers.json`)