Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models
Project description
poster2json
Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.
Documentation · Changelog · Report Bug · Request Feature
Description
poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.
The pipeline uses:
- Llama-3.1-8B-Poster-Extraction for JSON structuring
- Qwen2-VL-7B for vision-based OCR of image posters
- pdfalto for layout-aware PDF text extraction
Quick Start
Installation
pip install poster2json
CLI Usage
# Extract metadata from a poster
poster2json extract poster.pdf -o result.json
# Validate extracted JSON
poster2json validate result.json
# Process multiple posters
poster2json batch ./posters/ -o ./output/
Python API
from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate the result
is_valid = validate_poster(result)
Output Format
Output conforms to the poster-json-schema (DataCite-based):
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": ["University"]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
"tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}
System Requirements
| Requirement | Specification |
|---|---|
| GPU | NVIDIA CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |
Performance
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
Pass Rate: 10/10 (100%)
Documentation
| Document | Description |
|---|---|
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
Development Setup
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows
# Install poetry
pip install poetry
# Install dependencies
poetry install
# Run tests
poe test
# Format code
poe format
If you are on windows and have multiple python versions, you can use the following commands:
py -0p # list all python versions
py -3.12 -m venv .venv
License
MIT License - see LICENSE for details.
Citation
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}
Acknowledgements
- FAIR Data Innovations Hub
- Meta AI for Llama 3.1
- Alibaba Cloud for Qwen2-VL
- Part of the posters.science platform
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poster2json-0.1.7.tar.gz.
File metadata
- Download URL: poster2json-0.1.7.tar.gz
- Upload date:
- Size: 40.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.12 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b0956c99af33a4154ff4c5cdaafe15a2f2002d8d8fa120441920fa68ea0bf2b
|
|
| MD5 |
a6cb4c7e30bd3d45ca9d8079c491ecbb
|
|
| BLAKE2b-256 |
a20aa3cad251c9f5d08cf8cdbcbeb05f571de23912ce435fa0bd8ffab2b9420f
|
File details
Details for the file poster2json-0.1.7-py3-none-any.whl.
File metadata
- Download URL: poster2json-0.1.7-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.12 Linux/6.14.0-1017-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba80f94168cce03e6a23b222d74cb61d37bc70a7a203a8ba39af99d22c5e719
|
|
| MD5 |
576e1b303a08531104694330278581af
|
|
| BLAKE2b-256 |
fa498cbea9688f2b35436cf39be940ac88c7e918cd3f11e894e68c7ee3360694
|