A python too for extracting synthesis condition of MOFs directly from journal articles in any file format (html, xml and pdf

These details have not been verified by PyPI

Project links

Project description

mofsyncondition

mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.

The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.

Overview

Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery. mofsyncondition addresses this problem by:

Reading journal articles in HTML, pdf or xml format
Identifying synthesis-related paragraphs using ML-based classification
Extracting structured synthesis conditions from unstructured text
Generating datasets suitable for machine learning and LLM training

Key Features

Support for HTML and PDF-derived text inputs
ML-based identification of synthesis protocols
Regex-driven extraction of synthesis conditions
Modular and extensible Python design
Scalable for large literature datasets

Extracted Synthesis Information

The module aims to extract synthesis parameters such as:

Metal precursors
Organic linkers
Solvents
Additives / modulators
Reaction temperature
Reaction time
pH (when available)
Synthetic methods (e.g. solvothermal, hydrothermal)
Pressure and humidity (when available)
Name of MOF or formular is provided

Named Entity Recognition for Chemical Reagents

In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.

The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:

Component	Labels
`ner`	`ATMOSPHERE`, `METAL_SALT`, `MODULATOR`, `MOF`, `ORGANIC_LIGAND`, `SOLVENT`, `SYNTH_METHOD`

This NER layer enables reliable extraction of:

Metal precursors and salts
Organic ligands / linkers
Solvents and modulators
Synthetic methods (e.g., solvothermal, hydrothermal)
Reaction atmosphere (e.g., air, nitrogen, argon)
MOF names (when explicitly stated)

These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.

NER Model Performance

Overall evaluation scores on held-out data:

Metric	Score
`ENTS_F`	91.66
`ENTS_P`	92.78
`ENTS_R`	90.56
`TOK2VEC_LOSS`	26365.16
`NER_LOSS`	78555.25

Per-Entity Performance

Entity Type	Precision (P)	Recall (R)	F1-score (F)
METAL_SALT	0.9292	0.9082	0.9186
ORGANIC_LIGAND	0.7600	0.7157	0.7372
SOLVENT	0.9815	0.9900	0.9857
MODULATOR	0.9722	0.9560	0.9640
ATMOSPHERE	0.9715	0.9662	0.9689
SYNTH_METHOD	0.9970	0.9941	0.9955
MOF	0.6797	0.4973	0.5744

Installation

Clone the repository and install the package locally:

git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .

PYPI

The module can be install using PYPI

   pip install mofsyncondition

Usage

1. Extract synthetic paragraph from file

Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.

    from mofsyncondition.synthesis_conditions import extractor

    # filepaths
    pdf_file_path = '../filename.pdf'
    html_file_path = '../filename.html'
    xml_file_path = '../filename.xml'

    # declare extractor class
    text_extractor = extractor.MOFSynConditionExtractor()

    # PDF extraction

    list_of_paragraphs = text_extractor.read_file(pdf_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # html extraction

    list_of_paragraphs = text_extractor.read_file(html_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # xml extraction

    list_of_paragraphs = text_extractor.read_file(xml_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)

By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.

ML Model Performance (5-Fold Cross-Validation Averages)

Rank	Model	Avg Accuracy	Avg Precision	Notes
1	SVM_tfv	0.9905	0.8163	Default model
2	NN_tfv	0.9903	0.8143
3	RF_tfv	0.9904	0.7730	High accuracy, lower precision
4	RF_CV	0.9902	0.7692	Stable but conservative
5	NN_CV	0.9889	0.8240	High precision
6	LR_tfv	0.9895	0.7853	Fast baseline
7	LR_CV	0.9885	0.8040	Balanced baseline
8	SVM_CV	0.9885	0.8124	Robust alternative
9	DT_CV	0.9865	0.7795	Interpretable
10	DT_tfv	0.9851	0.7692	Simple model
11	NB_CV	0.9837	0.8337	Highest precision
12	NB_tfv	0.9657	0.0232	Not recommended

To use any model, simply add the name of the model to the function. e.g

   list_of_paragraphs = text_extractor.read_file(xml_file_path)
   synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")

2. Extract paragaraph level synthetic condition from file

Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.

import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper

data_extractor = MOFSynConditionExtractor()

transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"

all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
    syn_data  = data_extractor.syn_data_from_document(file_path)
    for paragraph, data_style_1, data_style_2 in syn_data:
        transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
        standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})
filetyper.list_2_json(transformer_dataset, 'transformer_dataset.jsonl')
filetyper.list_2_json(standard_dataset, 'standard_dataset.json')

LICENSE

MIT license

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Feb 3, 2026

0.1.2

Feb 3, 2026

0.1.1

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mofsyncondition-0.1.3.tar.gz (38.4 MB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mofsyncondition-0.1.3-py3-none-any.whl (38.8 MB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file mofsyncondition-0.1.3.tar.gz.

File metadata

Download URL: mofsyncondition-0.1.3.tar.gz
Upload date: Feb 3, 2026
Size: 38.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`5b5abe6d8f619f413e02d72e42f161c8b8f5158d5530a936ac8cbaefe402e2f8`
MD5	`293ee1598c4c36108da305bb7a0b69db`
BLAKE2b-256	`19835946c62cce5d75053fd9be3d0db0b01aeb622164a5e8cefe7abb6d8e6202`

See more details on using hashes here.

File details

Details for the file mofsyncondition-0.1.3-py3-none-any.whl.

File metadata

Download URL: mofsyncondition-0.1.3-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 38.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7797650044e7822582060f75143c78bc00c9fe4877061130782d1b746894892`
MD5	`8b748ba2ba8e29ad5d910090b79512ba`
BLAKE2b-256	`f608e382c725703fa2329464f38645f75b8541fe47785839494e856b1cd68276`

See more details on using hashes here.

mofsyncondition 0.1.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

mofsyncondition

Overview

Key Features

Extracted Synthesis Information

Named Entity Recognition for Chemical Reagents

NER Model Performance

Per-Entity Performance

Installation

PYPI

Usage

1. Extract synthetic paragraph from file

ML Model Performance (5-Fold Cross-Validation Averages)

2. Extract paragaraph level synthetic condition from file

LICENSE

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes