Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models
Project description
poster2json
Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.
Documentation · Changelog · Report Bug · Request Feature
Description
poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.
The pipeline uses:
- Llama-3.1-8B-Instruct (a verbatim mirror of Meta's release; swap with any HuggingFace instruct model via
--model) for JSON structuring - Qwen2-VL-7B for vision-based OCR of image posters
- pdfalto for layout-aware PDF text extraction
- lingua-language-detector for ISO 639-1 language detection on body text (overrides any value the model emits — body text beats metadata-fragment guessing)
- ROR (
https://api.ror.org) for affiliation and publisher canonicalisation; matched names get a ROR identifier attached - SPDX matching (with integer-exact version handling) for license normalisation in
rightsList
Quick Start
Installation
pip install poster2json
CLI Usage
# Extract metadata from a poster (default: Llama-3.1-8B-Instruct @ 4bit)
poster2json extract poster.pdf -o result.json
# Use a different instruct model (any HuggingFace repo id works)
poster2json extract poster.pdf --model google/gemma-2-9b-it --quantization 4bit
# Trade VRAM for quality
poster2json extract poster.pdf --quantization 8bit
poster2json extract poster.pdf --quantization fp16
# Validate extracted JSON
poster2json validate result.json
# Process multiple posters
poster2json batch ./posters/ -o ./output/
Python API
from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate the result
is_valid = validate_poster(result)
Output Format
Output conforms to the poster-json-schema (DataCite 4.7):
{
"$schema": "https://posters.science/schema/v0.2/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": [
{
"name": "Stanford University",
"affiliationIdentifier": "https://ror.org/00f54p054",
"affiliationIdentifierScheme": "ROR",
"schemeUri": "https://ror.org/"
}
]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"publicationYear": 2025,
"language": "en",
"researchField": "Health Sciences",
"subjects": [
{ "subject": "Machine Learning" },
{ "subject": "Diabetic Retinopathy" }
],
"descriptions": [
{ "description": "We present a deep learning model...", "descriptionType": "Abstract" }
],
"publisher": { "name": "Zenodo" },
"rightsList": [
{
"rights": "Creative Commons Attribution 4.0 International",
"rightsIdentifier": "CC-BY-4.0",
"rightsIdentifierScheme": "SPDX",
"schemeUri": "https://spdx.org/licenses/",
"rightsUri": "https://creativecommons.org/licenses/by/4.0/"
}
],
"content": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "id": "fig1", "caption": "Figure 1. ROC curves showing..." }],
"tableCaptions": [{ "id": "table1", "caption": "Table 1. Performance metrics" }]
}
Notes on the auto-populated fields:
languageis detected from the raw body text (lingua heuristic). Returns null when text is too short (<200 chars / <50 non-ASCII codepoints) or the detector is unsure.researchFieldmust be one of the four OpenAlex top-level domains:Health Sciences,Life Sciences,Physical Sciences,Social Sciences. Null when the model can't pick one confidently.affiliationandpublisherget ROR enrichment when the matcher returns a high-confidence chosen result. Strings without a confident match pass through unchanged. SetPOSTER2JSON_ROR=0to disable.rightsListentries are matched against an SPDX table; the matcher is conservative on version numbers (e.g.CC-BY-4.0andCC-BY-4.1are never confused).
System Requirements
| Requirement | Specification |
|---|---|
| GPU | NVIDIA CUDA-capable, ≥8GB VRAM (default 4bit); ≥16GB for --quantization fp16 or image/OCR posters |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |
Performance
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
Pass Rate: 10/10 (100%)
Documentation
| Document | Description |
|---|---|
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
Development Setup
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows
# Install poetry
pip install poetry
# Install dependencies
poetry install
# Run tests
poe test
# Format code
poe format
If you are on windows and have multiple python versions, you can use the following commands:
py -0p # list all python versions
py -3.12 -m venv .venv
License
MIT License - see LICENSE for details.
Citation
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
version = {0.4.3},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}
Funding
This project is funded by The Navigation Fund (10.71707/rk36-9x79).
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poster2json-0.5.3.tar.gz.
File metadata
- Download URL: poster2json-0.5.3.tar.gz
- Upload date:
- Size: 54.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.0 CPython/3.12.13 Linux/6.17.0-1010-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34bf05fa42f355aea1236420e873c774d4f4b88674e063bc92b1f59129720ca1
|
|
| MD5 |
442af13dbaabbbc1b95d7fae57fdbd50
|
|
| BLAKE2b-256 |
af527508cdc2b8f1456a8aae1fd7cbadaf4885321b31f8e2d2d9323cf6918342
|
File details
Details for the file poster2json-0.5.3-py3-none-any.whl.
File metadata
- Download URL: poster2json-0.5.3-py3-none-any.whl
- Upload date:
- Size: 57.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.0 CPython/3.12.13 Linux/6.17.0-1010-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afec47702820551511aeb24d5912b9746c0987664acac722b6611ead41358941
|
|
| MD5 |
4191b95ab0dcf3b792d3543283d0a735
|
|
| BLAKE2b-256 |
593524fd0734aa261a90290f0979420f90be61b1c8a32085548c98746855418b
|