Skip to main content

Scientific paper acquisition, extraction, preprocessing, detection and captioning pipeline

Project description

MatMMExtract Logo

MatmmExtract

MatmmExtract is an end-to-end pipeline for building multimodal materials-science datasets from scientific literature.

Starting from OpenAlex or Scopus metadata, MatmmExtract automatically retrieves papers, extracts figures and captions, downloads images, detects sub-panels, generates fine-grained captions using modern LLMs, and links everything into a machine-learning-ready dataset.


Features

Literature Acquisition

  • OpenAlex search integration

  • Elsevier full-text retrieval

  • Springer full-text retrieval

  • Scopus CSV workflow support

  • Open-access filtering

  • DOI deduplication

Figure Extraction

  • Parse Elsevier XML articles

  • Parse Springer XML articles

  • Extract figure metadata

  • Extract captions

  • Extract figure reference sentences

  • Preserve paper-level metadata

License Processing

  • Detect Creative Commons licenses

  • Filter CC-BY content

  • Generate license audit reports

  • Support large-scale corpus filtering

Image Downloading

  • Download publisher-hosted figures

  • Retry and resume support

  • Download logging

  • Filename normalization

Vision Pipeline

  • Scientific figure panel detection

  • Automatic crop generation

  • Crop-to-figure linking

  • Dataset preparation utilities

Caption Generation

  • Gemini support

  • Azure OpenAI support

  • Batch captioning workflows

  • Rate-limited API execution

  • Structured JSON outputs

Dataset Construction

  • Link crops, captions, figures, and metadata

  • Generate training-ready CSV datasets

  • Create multimodal instruction-tuning corpora


Installation

pip install matmmextract

Or install from source:

git clone https://github.com/<your-org>/matmmextract.git
cd matmmextract

pip install -e .

Quick Start

Search OpenAlex

from matmmextract.openalex import fetch_elsevier

fetch_elsevier(
    keywords=["titanium alloy", "microstructure"],
    license_="cc-by",
    from_year=2020,
    to_year=2024,
    max_results=100,
    output_csv="papers.csv",
)

Fetch Elsevier XMLs

from matmmextract.elsevier import fetch_all

fetch_all(
    df=papers_df,
    api_key="YOUR_API_KEY",
    inst_token="YOUR_INST_TOKEN",
    output_dir="xmls",
)

Detect Panels

from matmmextract.inference import detect

detect(
    image_dir="images",
    output_dir="detections",
    checkpoint="best.pt",
)

Generate Captions

from matmmextract.inference import gemini_captioner

gemini_captioner(
    csv_path="crops.csv",
    output_dir="subcaptions",
    api_key="YOUR_API_KEY",
)

Build Final Dataset

from matmmextract.inference import build

build(
    images_dir="crops",
    json_dir="subcaptions",
    output_csv="linked_dataset.csv",
)

Example Pipelines

MatmmExtract ships with complete examples:

examples/
├── elsevier_full.py
├── elsevier_scopus.py
├── springer_full.py
└── springer_scopus.py

OpenAlex → Elsevier → Azure → Dataset

python examples/elsevier_full.py

Scopus → Elsevier → Azure → Dataset

python examples/elsevier_scopus.py

OpenAlex → Springer → Gemini → Dataset

python examples/springer_full.py

Scopus → Springer → Azure → Dataset

python examples/springer_scopus.py

Package Structure

matmmextract
├── openalex
├── elsevier
├── springer
├── preprocess
├── inference
└── shared

OpenAlex

Paper discovery and metadata retrieval.

Elsevier

Full-text retrieval, figure extraction, and image downloading.

Springer

Full-text retrieval, figure extraction, and image downloading.

Preprocess

Dataset filtering, DOI processing, publisher filtering, and license analysis.

Inference

Detection, cropping, caption generation, and dataset construction.

Shared

Common utilities used throughout the pipeline.


Documentation

Build locally:

sphinx-build -b html docs docs/_build

Generated documentation:

docs/_build/index.html

Citation

If you use MatmmExtract in academic work, please cite:

MatmmExtract: A Pipeline for Constructing Multimodal Materials-Science Datasets
from Scientific Literature.

License

GNU General Public License v3.0 (GPL-3.0).

See the LICENSE file for details.


Authors

  • Subham Ghosh

  • Abhishek Tewari

  • Mohammad Ibrahim

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matmmextract-0.1.1.tar.gz (65.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matmmextract-0.1.1-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file matmmextract-0.1.1.tar.gz.

File metadata

  • Download URL: matmmextract-0.1.1.tar.gz
  • Upload date:
  • Size: 65.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for matmmextract-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d60826d982196adcb5b2d9f8247f4be1ed124cbc05a782cc4411500d51146530
MD5 7c845efc9b991e70db68526b789dc269
BLAKE2b-256 b85fc04cef23b9555bfebd0f0a8d0ab95cdd3e4355db70873e60b5d68117b6a7

See more details on using hashes here.

File details

Details for the file matmmextract-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: matmmextract-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 78.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for matmmextract-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c27e75d0d914d4968bef94293a7dacb3dfdb46e80d5f565734a7512858dbaf8f
MD5 f2c058ccd59076f90a9a89ae36cf94de
BLAKE2b-256 20e9a565dbd3ef696c3e28012f2a9ebd5dab74d4ff6053111534c5f22e816e5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page