Skip to main content

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models

Project description

logo

poster2json

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.


contributors stars open issues license

PyPI Version PyPI Downloads DOI

Documentation · Changelog · Report Bug · Request Feature



Description

poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.

The pipeline uses:

  • Llama 3.1 8B (fine-tuned) for JSON structuring
  • Qwen2-VL-7B for vision-based OCR of image posters
  • pdfalto for layout-aware PDF text extraction

Quick Start

Installation

pip install poster2json

CLI Usage

# Extract metadata from a poster
poster2json extract poster.pdf -o result.json

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/

Python API

from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)

Output Format

Output conforms to the poster-json-schema (DataCite-based):

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "affiliation": ["University"]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "posterContent": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
  "tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}

System Requirements

Requirement Specification
GPU NVIDIA CUDA-capable, ≥16GB VRAM
RAM ≥32GB recommended
Python 3.10+
OS Linux, macOS, Windows (via WSL2)

Performance

Validated on 10 manually annotated scientific posters:

Metric Score Threshold
Word Capture 0.96 ≥0.75
ROUGE-L 0.89 ≥0.75
Number Capture 0.93 ≥0.75
Field Proportion 0.99 0.30–2.50

Pass Rate: 10/10 (100%)

Documentation

Document Description
Architecture Technical details & methodology
Evaluation Validation metrics & results

Development Setup

# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate # On Windows: .venv\Scripts\activate

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format

If you are on windows and have multiple python versions, you can use the following commands:

py -0p # list all python versions

py -3.12 -m venv .venv

License

MIT License - see LICENSE for details.

Citation

@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}

Acknowledgements

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poster2json-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poster2json-0.1.0-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file poster2json-0.1.0.tar.gz.

File metadata

  • Download URL: poster2json-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.14 Linux/6.11.0-1018-azure

File hashes

Hashes for poster2json-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1c0b175a9bd82ac4b35136990fc0b1fae5d4a0ceec0fbb36a627d993e5d93f08
MD5 8cfca1e96bbeb7d3853c89ecd7e373b8
BLAKE2b-256 92d4f164454559a44e62a0a098e9f860d1126cddd907f6e80d0471c2681c8b88

See more details on using hashes here.

File details

Details for the file poster2json-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: poster2json-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.14 Linux/6.11.0-1018-azure

File hashes

Hashes for poster2json-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8cafef46083be084c10ddd40f16d5d257f90dd57dca7cc5fac2b80943815ec8e
MD5 1a83f13b7b12b72ec9bab9b8c303f822
BLAKE2b-256 be3ca4c504ec28e3745b660e74036b4942185b7b2f8ccf156a5a0313809e868a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page