Scientific paper acquisition, extraction, preprocessing, detection and captioning pipeline

These details have not been verified by PyPI

Project description

MatMMExtract Logo

MatmmExtract

MatmmExtract is an end-to-end pipeline for building multimodal materials-science datasets from scientific literature.

Starting from OpenAlex or Scopus metadata, MatmmExtract automatically retrieves papers, extracts figures and captions, downloads images, detects sub-panels, generates fine-grained captions using modern LLMs, and links everything into a machine-learning-ready dataset.

Features

Literature Acquisition

OpenAlex search integration
Elsevier full-text retrieval
Springer full-text retrieval
Scopus CSV workflow support
Open-access filtering
DOI deduplication

Figure Extraction

Parse Elsevier XML articles
Parse Springer XML articles
Extract figure metadata
Extract captions
Extract figure reference sentences
Preserve paper-level metadata

License Processing

Detect Creative Commons licenses
Filter CC-BY content
Generate license audit reports
Support large-scale corpus filtering

Image Downloading

Download publisher-hosted figures
Retry and resume support
Download logging
Filename normalization

Vision Pipeline

Scientific figure panel detection
Automatic crop generation
Crop-to-figure linking
Dataset preparation utilities

Caption Generation

Gemini support
Azure OpenAI support
Batch captioning workflows
Rate-limited API execution
Structured JSON outputs

Dataset Construction

Link crops, captions, figures, and metadata
Generate training-ready CSV datasets
Create multimodal instruction-tuning corpora

Installation

pip install matmmextract

Or install from source:

git clone https://github.com/<your-org>/matmmextract.git
cd matmmextract

pip install -e .

Quick Start

Search OpenAlex

from matmmextract.openalex import fetch_elsevier

fetch_elsevier(
    keywords=["titanium alloy", "microstructure"],
    license_="cc-by",
    from_year=2020,
    to_year=2024,
    max_results=100,
    output_csv="papers.csv",
)

Fetch Elsevier XMLs

from matmmextract.elsevier import fetch_all

fetch_all(
    df=papers_df,
    api_key="YOUR_API_KEY",
    inst_token="YOUR_INST_TOKEN",
    output_dir="xmls",
)

Detect Panels

from matmmextract.inference import detect

detect(
    image_dir="images",
    output_dir="detections",
    checkpoint="best.pt",
)

Generate Captions

from matmmextract.inference import gemini_captioner

gemini_captioner(
    csv_path="crops.csv",
    output_dir="subcaptions",
    api_key="YOUR_API_KEY",
)

Build Final Dataset

from matmmextract.inference import build

build(
    images_dir="crops",
    json_dir="subcaptions",
    output_csv="linked_dataset.csv",
)

Example Pipelines

MatmmExtract ships with complete examples:

examples/
├── elsevier_full.py
├── elsevier_scopus.py
├── springer_full.py
└── springer_scopus.py

OpenAlex → Elsevier → Azure → Dataset

python examples/elsevier_full.py

Scopus → Elsevier → Azure → Dataset

python examples/elsevier_scopus.py

OpenAlex → Springer → Gemini → Dataset

python examples/springer_full.py

Scopus → Springer → Azure → Dataset

python examples/springer_scopus.py

Package Structure

matmmextract
├── openalex
├── elsevier
├── springer
├── preprocess
├── inference
└── shared

OpenAlex

Paper discovery and metadata retrieval.

Elsevier

Full-text retrieval, figure extraction, and image downloading.

Springer

Full-text retrieval, figure extraction, and image downloading.

Preprocess

Dataset filtering, DOI processing, publisher filtering, and license analysis.

Inference

Detection, cropping, caption generation, and dataset construction.

Shared

Common utilities used throughout the pipeline.

Documentation

Build locally:

sphinx-build -b html docs docs/_build

Generated documentation:

docs/_build/index.html

Citation

If you use MatmmExtract in academic work, please cite:

MatmmExtract: A Pipeline for Constructing Multimodal Materials-Science Datasets
from Scientific Literature.

License

GNU General Public License v3.0 (GPL-3.0).

See the LICENSE file for details.

Authors

Subham Ghosh
Abhishek Tewari
Mohammad Ibrahim

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jun 30, 2026

This version

0.1.1

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matmmextract-0.1.1.tar.gz (65.5 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matmmextract-0.1.1-py3-none-any.whl (78.6 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file matmmextract-0.1.1.tar.gz.

File metadata

Download URL: matmmextract-0.1.1.tar.gz
Upload date: Jun 30, 2026
Size: 65.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for matmmextract-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d60826d982196adcb5b2d9f8247f4be1ed124cbc05a782cc4411500d51146530`
MD5	`7c845efc9b991e70db68526b789dc269`
BLAKE2b-256	`b85fc04cef23b9555bfebd0f0a8d0ab95cdd3e4355db70873e60b5d68117b6a7`

See more details on using hashes here.

File details

Details for the file matmmextract-0.1.1-py3-none-any.whl.

File metadata

Download URL: matmmextract-0.1.1-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 78.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for matmmextract-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c27e75d0d914d4968bef94293a7dacb3dfdb46e80d5f565734a7512858dbaf8f`
MD5	`f2c058ccd59076f90a9a89ae36cf94de`
BLAKE2b-256	`20e9a565dbd3ef696c3e28012f2a9ebd5dab74d4ff6053111534c5f22e816e5f`

See more details on using hashes here.

matmmextract 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MatmmExtract

Features

Literature Acquisition

Figure Extraction

License Processing

Image Downloading

Vision Pipeline

Caption Generation

Dataset Construction

Installation

Quick Start

Search OpenAlex

Fetch Elsevier XMLs

Detect Panels

Generate Captions

Build Final Dataset

Example Pipelines

OpenAlex → Elsevier → Azure → Dataset

Scopus → Elsevier → Azure → Dataset

OpenAlex → Springer → Gemini → Dataset

Scopus → Springer → Azure → Dataset

Package Structure

OpenAlex

Elsevier

Springer

Preprocess

Inference

Shared

Documentation

Citation

License

Authors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes