Scientific paper acquisition, extraction, preprocessing, detection and captioning pipeline
Project description
MatmmExtract
MatmmExtract is an end-to-end pipeline for building multimodal materials-science datasets from scientific literature.
Starting from OpenAlex or Scopus metadata, MatmmExtract automatically retrieves papers, extracts figures and captions, downloads images, detects sub-panels, generates fine-grained captions using modern LLMs, and links everything into a machine-learning-ready dataset.
Features
Literature Acquisition
-
OpenAlex search integration
-
Elsevier full-text retrieval
-
Springer full-text retrieval
-
Scopus CSV workflow support
-
Open-access filtering
-
DOI deduplication
Figure Extraction
-
Parse Elsevier XML articles
-
Parse Springer XML articles
-
Extract figure metadata
-
Extract captions
-
Extract figure reference sentences
-
Preserve paper-level metadata
License Processing
-
Detect Creative Commons licenses
-
Filter CC-BY content
-
Generate license audit reports
-
Support large-scale corpus filtering
Image Downloading
-
Download publisher-hosted figures
-
Retry and resume support
-
Download logging
-
Filename normalization
Vision Pipeline
-
Scientific figure panel detection
-
Automatic crop generation
-
Crop-to-figure linking
-
Dataset preparation utilities
Caption Generation
-
Gemini support
-
Azure OpenAI support
-
Batch captioning workflows
-
Rate-limited API execution
-
Structured JSON outputs
Dataset Construction
-
Link crops, captions, figures, and metadata
-
Generate training-ready CSV datasets
-
Create multimodal instruction-tuning corpora
Installation
pip install matmmextract
Or install from source:
git clone https://github.com/<your-org>/matmmextract.git
cd matmmextract
pip install -e .
Quick Start
Search OpenAlex
from matmmextract.openalex import fetch_elsevier
fetch_elsevier(
keywords=["titanium alloy", "microstructure"],
license_="cc-by",
from_year=2020,
to_year=2024,
max_results=100,
output_csv="papers.csv",
)
Fetch Elsevier XMLs
from matmmextract.elsevier import fetch_all
fetch_all(
df=papers_df,
api_key="YOUR_API_KEY",
inst_token="YOUR_INST_TOKEN",
output_dir="xmls",
)
Detect Panels
from matmmextract.inference import detect
detect(
image_dir="images",
output_dir="detections",
checkpoint="best.pt",
)
Generate Captions
from matmmextract.inference import gemini_captioner
gemini_captioner(
csv_path="crops.csv",
output_dir="subcaptions",
api_key="YOUR_API_KEY",
)
Build Final Dataset
from matmmextract.inference import build
build(
images_dir="crops",
json_dir="subcaptions",
output_csv="linked_dataset.csv",
)
Example Pipelines
MatmmExtract ships with complete examples:
examples/
├── elsevier_full.py
├── elsevier_scopus.py
├── springer_full.py
└── springer_scopus.py
OpenAlex → Elsevier → Azure → Dataset
python examples/elsevier_full.py
Scopus → Elsevier → Azure → Dataset
python examples/elsevier_scopus.py
OpenAlex → Springer → Gemini → Dataset
python examples/springer_full.py
Scopus → Springer → Azure → Dataset
python examples/springer_scopus.py
Package Structure
matmmextract
├── openalex
├── elsevier
├── springer
├── preprocess
├── inference
└── shared
OpenAlex
Paper discovery and metadata retrieval.
Elsevier
Full-text retrieval, figure extraction, and image downloading.
Springer
Full-text retrieval, figure extraction, and image downloading.
Preprocess
Dataset filtering, DOI processing, publisher filtering, and license analysis.
Inference
Detection, cropping, caption generation, and dataset construction.
Shared
Common utilities used throughout the pipeline.
Documentation
Build locally:
sphinx-build -b html docs docs/_build
Generated documentation:
docs/_build/index.html
Citation
If you use MatmmExtract in academic work, please cite:
MatmmExtract: A Pipeline for Constructing Multimodal Materials-Science Datasets
from Scientific Literature.
License
GNU General Public License v3.0 (GPL-3.0).
See the LICENSE file for details.
Authors
-
Subham Ghosh
-
Abhishek Tewari
-
Mohammad Ibrahim
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matmmextract-0.1.2.tar.gz.
File metadata
- Download URL: matmmextract-0.1.2.tar.gz
- Upload date:
- Size: 65.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb2a83a5470a0982c38fb1596623993821faf29f2569e481656ae13266bb097d
|
|
| MD5 |
2c0e5bcbdc6b154ea97f49321492b8cd
|
|
| BLAKE2b-256 |
d69e4073e6640072df6feac48f5bf3c3b5d5eb0c1c719c37ba4dd29fcdb19399
|
File details
Details for the file matmmextract-0.1.2-py3-none-any.whl.
File metadata
- Download URL: matmmextract-0.1.2-py3-none-any.whl
- Upload date:
- Size: 78.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f75448b734c28d6cd781d5890bac164f2b1655221f36b64fdc610b7b2bbe385
|
|
| MD5 |
c698c3873fb1bedde7399859f6fcaa7b
|
|
| BLAKE2b-256 |
fa98117008e1d8bee42df7cc7081ceeac0f6dd6884e78f7f68778d3612acb0ce
|