Skip to main content

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models

Project description

logo

poster2json

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.


contributors stars open issues license

PyPI Version PyPI Downloads DOI

Documentation · Changelog · Report Bug · Request Feature



Description

poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.

The pipeline uses:

  • Llama-3.1-8B-Instruct (a verbatim mirror of Meta's release; swap with any HuggingFace instruct model via --model) for JSON structuring
  • Qwen2-VL-7B for vision-based OCR of image posters
  • pdfplumber for layout-aware PDF text extraction
  • lingua-language-detector for ISO 639-1 language detection on body text (overrides any value the model emits — body text beats metadata-fragment guessing)
  • ROR (https://api.ror.org) for affiliation and publisher canonicalisation; matched names get a ROR identifier attached
  • SPDX matching (with integer-exact version handling) for license normalisation in rightsList

Quick Start

Installation

pip install poster2json

CLI Usage

# Extract metadata from a poster (default: Llama-3.1-8B-Instruct @ 4bit)
poster2json extract poster.pdf -o result.json

# Use a different instruct model (any HuggingFace repo id works)
poster2json extract poster.pdf --model google/gemma-2-9b-it --quantization 4bit

# Trade VRAM for quality
poster2json extract poster.pdf --quantization 8bit
poster2json extract poster.pdf --quantization fp16

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/

Python API

from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)

Output Format

Output conforms to the poster-json-schema (DataCite 4.7):

{
  "$schema": "https://posters.science/schema/v0.2/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "affiliation": [
        {
          "name": "Stanford University",
          "affiliationIdentifier": "https://ror.org/00f54p054",
          "affiliationIdentifierScheme": "ROR",
          "schemeUri": "https://ror.org/"
        }
      ]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "publicationYear": 2025,
  "language": "en",
  "researchField": "Health Sciences",
  "subjects": [
    { "subject": "Machine Learning" },
    { "subject": "Diabetic Retinopathy" }
  ],
  "descriptions": [
    { "description": "We present a deep learning model...", "descriptionType": "Abstract" }
  ],
  "publisher": { "name": "Zenodo" },
  "rightsList": [
    {
      "rights": "Creative Commons Attribution 4.0 International",
      "rightsIdentifier": "CC-BY-4.0",
      "rightsIdentifierScheme": "SPDX",
      "schemeUri": "https://spdx.org/licenses/",
      "rightsUri": "https://creativecommons.org/licenses/by/4.0/"
    }
  ],
  "content": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "id": "fig1", "caption": "Figure 1. ROC curves showing..." }],
  "tableCaptions": [{ "id": "table1", "caption": "Table 1. Performance metrics" }]
}

Notes on the auto-populated fields:

  • language is detected from the raw body text (lingua heuristic). Returns null when text is too short (<200 chars / <50 non-ASCII codepoints) or the detector is unsure.
  • researchField must be one of the four OpenAlex top-level domains: Health Sciences, Life Sciences, Physical Sciences, Social Sciences. Null when the model can't pick one confidently.
  • affiliation and publisher get ROR enrichment when the matcher returns a high-confidence chosen result. Strings without a confident match pass through unchanged. Set POSTER2JSON_ROR=0 to disable.
  • rightsList entries are matched against an SPDX table; the matcher is conservative on version numbers (e.g. CC-BY-4.0 and CC-BY-4.1 are never confused).

System Requirements

Requirement Specification
GPU NVIDIA CUDA-capable, ≥8GB VRAM (default 4bit); ≥16GB for --quantization fp16 or image/OCR posters
RAM ≥32GB recommended
Python 3.10+
OS Linux, macOS, Windows (via WSL2)

Performance

Validated on 20 manually annotated scientific posters (19 PDF via pdfplumber, 1 image via vision OCR):

Metric Score Threshold
Word Capture 0.92 ≥0.75
ROUGE-L 0.85 ≥0.75
Number Capture 0.97 ≥0.75
Field Proportion 0.88 0.50–1.50

Pass Rate: 19/20 (95%). The single failure is a dense table/flowchart poster whose reference annotation splits one visual region into many fine-grained sections.

Documentation

Document Description
Architecture Technical details & methodology
Evaluation Validation metrics & results

Development Setup

# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format

If you are on windows and have multiple python versions, you can use the following commands:

py -0p # list all python versions

py -3.12 -m venv .venv

License

MIT License - see LICENSE for details.

Citation

@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  version = {0.8.0},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}

Funding

This project is funded by The Navigation Fund (10.71707/rk36-9x79).

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

poster2json-0.9.17.tar.gz (79.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

poster2json-0.9.17-py3-none-any.whl (83.7 kB view details)

Uploaded Python 3

File details

Details for the file poster2json-0.9.17.tar.gz.

File metadata

  • Download URL: poster2json-0.9.17.tar.gz
  • Upload date:
  • Size: 79.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.12.13 Linux/6.17.0-1018-azure

File hashes

Hashes for poster2json-0.9.17.tar.gz
Algorithm Hash digest
SHA256 b1479cfac9730b8bd1ae9ce61190fcecd191f77c8ac6a8816ac4da9261b5137b
MD5 fe579eb3d47b6ec7ddba8e30476a8766
BLAKE2b-256 0605eea97c2e6b49c49d712183f41b4776ced7fea7828927acd45bd1e4795b0d

See more details on using hashes here.

File details

Details for the file poster2json-0.9.17-py3-none-any.whl.

File metadata

  • Download URL: poster2json-0.9.17-py3-none-any.whl
  • Upload date:
  • Size: 83.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.12.13 Linux/6.17.0-1018-azure

File hashes

Hashes for poster2json-0.9.17-py3-none-any.whl
Algorithm Hash digest
SHA256 8e3985b6a660bedc8e91f727f657eecb807dab9cb7b85828c832ec5827add176
MD5 f0097d564b81bf9ccf506f73e1a40734
BLAKE2b-256 b0e25cc0b3ba6190fb5dcd3c4fa5592596b534db735c77db9f71c5bb5dc0680f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page