No project description provided

These details have not been verified by PyPI

Project description

PERLA Extract

PERLA Extract is an automated data extraction tool that uses large language models (LLMs) to identify and structure key information on perovskite solar cells from scientific papers. This includes device parameters, material compositions, and performance metrics, all of which are collected and stored in the PERLA database.

Features

🔬 Intelligent Extraction: Automatically extracts structured data about perovskite solar cells from scientific papers
📄 Multiple PDF Processors: Supports multiple PDF preprocessing methods (PyMuPDF, Nougat, Marker)
🤖 LLM Integration: Works with various LLM providers via LiteLLM (Claude, GPT-4, GPT-5, and more)
✅ Structured Output: Validates and structures data using Pydantic models
🔄 Post-processing: Automatic unit normalization and data validation
📊 Evaluation Metrics: Built-in precision and recall evaluation against ground truth
📤 Export Formats: Export to JSON or NOMAD archive format
🤖 Automated Discovery: Papersbot integration for automated paper discovery and processing
📦 Evaluation Dataset: Includes ground truth data and extractions from multiple LLM models and human annotators for benchmarking

Installation

Prerequisites

Python 3.10 or higher
pip

Basic Installation

pip install perla-extract

Optional Dependencies

For specific PDF processors:

# For Nougat OCR processing
pip install perla-extract[nougat]

# For Marker PDF processing
pip install perla-extract[marker]

# For Redis-based caching (requires Redis server)
pip install perla-extract[cache]

# For development dependencies
pip install perla-extract[dev]

Note on Caching: By default, Perla Extract uses disk-based caching for LLM calls. If you have a Redis server available, you can install the cache extra and configure Redis via environment variables (REDIS_HOST, REDIS_PORT, REDIS_PASSWORD, REDIS_TTL) for persistent caching across sessions with better performance.

Data Directory

The data directory (src/perla_extract/data/) contains:

Extractions: Results from multiple LLM models and human annotators (including consensus annotations)
Ground Truth: Manually checked and corrected datasets (dev set for optimization, test set for evaluation)

See src/perla_extract/data/README.md for detailed information about the data structure and organization.

Quick Start

Setup

Set up the required environment variables for LLM API access and paper downloading:

# For Claude models (default)
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# For OpenAI models (alternative)
export OPENAI_API_KEY="your-openai-api-key"

# For downloading papers via Papersbot
export UNPAYWALL_EMAIL="your-email@example.com"

LiteLLM supports many providers. Set the appropriate API key environment variable for your chosen model:

ANTHROPIC_API_KEY for Claude models
OPENAI_API_KEY for GPT models
GOOGLE_API_KEY for Gemini models
See LiteLLM documentation for other providers

Run the Default Pipeline

The simplest way to see Perla Extract in action:

perla-extract

This will:

Download papers using Papersbot
Extract data from all PDFs using the default model
Clean up downloaded files

Extract Data from a PDF

# Single PDF
perla-extract extract pdfs/paper.pdf

# With specific model
perla-extract extract --model_name=gpt-4o-mini pdfs/paper.pdf --output results/

# Directory of PDFs
perla-extract extract pdfs/ --output extractions/

Evaluate Extractions

# Evaluate model against ground truth
perla-extract evaluate src/perla_extract/data/extractions/claude-opus-4-1-20250805/ src/perla_extract/data/ground_truth/test/

# Evaluate human performance
perla-extract evaluate src/perla_extract/data/extractions/humans/Consensus/ src/perla_extract/data/ground_truth/test/

Command Reference

`perla-extract extract`

Extract data from PDF files.

perla-extract extract <filepath> [--model_name=MODEL] [--preprocessor=PROCESSOR] [--output=DIR] [--nomad] [--nomad_upload_id=ID]

Key options:

--model_name: LLM model (default: claude-sonnet-4-20250514). Supports any LiteLLM model (e.g., gpt-4o-mini, claude-3-5-sonnet-20240620)
--preprocessor: PDF processor - pymupdf, nougat, or marker (default: pymupdf)
--output: Output directory (default: ./extractions)
--nomad: Upload to NOMAD repository
--use_cache: Enable API call caching

`perla-extract evaluate`

Evaluate extraction results against ground truth.

perla-extract evaluate <extraction_dir> <truth_dir>

`perla-extract papersbot`

Download papers automatically. Requires UNPAYWALL_EMAIL environment variable (see Quick Start for setup).

`perla-extract optimizer`

Run prompt optimization pipeline.

Uploading to NOMAD

Perla Extract can automatically upload extraction results to NOMAD, a materials science data repository.

Setup:

export NOMAD_USERNAME="your-username"
export NOMAD_PASSWORD="your-password"
export NOMAD_URL="https://nomad-lab.eu/prod/v1/"  # Optional

Usage:

# Upload to new upload
perla-extract extract --nomad pdfs/paper.pdf

# Append to existing upload
perla-extract extract --nomad --nomad_upload_id="upload-id" pdfs/paper.pdf

Each device/cell is uploaded as a separate NOMAD entry with automatic format conversion.

Authors

Sherjeel Shabih - sherjeel.shabih@hu-berlin.de
Pepe Marquez - jose.marquez@physik.hu-berlin.de
Kevin Jablonka - mail@kjablonka.com
Sharat Patil - sharat.patil@physik.hu-berlin.de

Citation

If you use Perla Extract in your research, please cite:

TODO:

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.5

Feb 3, 2026

This version

0.0.4

Jan 29, 2026

0.0.3

Jan 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perla_extract-0.0.4.tar.gz (383.4 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

perla_extract-0.0.4-py3-none-any.whl (769.8 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file perla_extract-0.0.4.tar.gz.

File metadata

Download URL: perla_extract-0.0.4.tar.gz
Upload date: Jan 29, 2026
Size: 383.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for perla_extract-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`f1c6ad83c967992176d0a4c010a77c2be9c593333ae360cd618dbb7fc2445e41`
MD5	`0c4680936b319965ac989fa9f29561ae`
BLAKE2b-256	`2c6822a801ae128a23ee9eaec5ca2bec973b6063608db25450af7d9722b28acb`

See more details on using hashes here.

File details

Details for the file perla_extract-0.0.4-py3-none-any.whl.

File metadata

Download URL: perla_extract-0.0.4-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 769.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for perla_extract-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`819ef61489f492f02ef019478102a04a42e15d60dc403bbe9d1d1346f5c4b26f`
MD5	`492a303fc4bd0734f460ba38023c3e16`
BLAKE2b-256	`c29b39870b58af0f652cbf4ff4d99a8a57f9555395ffed7f631414db1c60c004`

See more details on using hashes here.

perla-extract 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PERLA Extract

Features

Installation

Prerequisites

Basic Installation

Optional Dependencies

Data Directory

Quick Start

Setup

Run the Default Pipeline

Extract Data from a PDF

Evaluate Extractions

Command Reference

perla-extract extract

perla-extract evaluate

perla-extract papersbot

perla-extract optimizer

Uploading to NOMAD

Authors

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`perla-extract extract`

`perla-extract evaluate`

`perla-extract papersbot`

`perla-extract optimizer`