Skip to main content

A python too for extracting synthesis condition of MOFs directly from journal articles in any file format (html, xml and pdf

Project description

mofsyncondition

mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.

The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.


Overview

Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery. mofsyncondition addresses this problem by:

  • Reading journal articles in HTML, pdf or xml format
  • Identifying synthesis-related paragraphs using ML-based classification
  • Extracting structured synthesis conditions from unstructured text
  • Generating datasets suitable for machine learning and LLM training

Key Features

  • Support for HTML and PDF-derived text inputs
  • ML-based identification of synthesis protocols
  • Regex-driven extraction of synthesis conditions
  • Modular and extensible Python design
  • Scalable for large literature datasets

Extracted Synthesis Information

The module aims to extract synthesis parameters such as:

  • Metal precursors
  • Organic linkers
  • Solvents
  • Additives / modulators
  • Reaction temperature
  • Reaction time
  • pH (when available)
  • Synthetic methods (e.g. solvothermal, hydrothermal)
  • Pressure and humidity (when available)
  • Name of MOF or formular is provided

Named Entity Recognition for Chemical Reagents

In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.

The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:

Component Labels
ner ATMOSPHERE, METAL_SALT, MODULATOR, MOF, ORGANIC_LIGAND, SOLVENT, SYNTH_METHOD

This NER layer enables reliable extraction of:

  • Metal precursors and salts
  • Organic ligands / linkers
  • Solvents and modulators
  • Synthetic methods (e.g., solvothermal, hydrothermal)
  • Reaction atmosphere (e.g., air, nitrogen, argon)
  • MOF names (when explicitly stated)

These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.


NER Model Performance

Overall evaluation scores on held-out data:

Metric Score
ENTS_F 91.66
ENTS_P 92.78
ENTS_R 90.56
TOK2VEC_LOSS 26365.16
NER_LOSS 78555.25

Per-Entity Performance

Entity Type Precision (P) Recall (R) F1-score (F)
METAL_SALT 0.9292 0.9082 0.9186
ORGANIC_LIGAND 0.7600 0.7157 0.7372
SOLVENT 0.9815 0.9900 0.9857
MODULATOR 0.9722 0.9560 0.9640
ATMOSPHERE 0.9715 0.9662 0.9689
SYNTH_METHOD 0.9970 0.9941 0.9955
MOF 0.6797 0.4973 0.5744

Installation

Clone the repository and install the package locally:

git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .

PYPI

The module can be install using PYPI

   pip install mofsyncondition

Usage

1. Extract synthetic paragraph from file

Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.

    from mofsyncondition.synthesis_conditions import extractor

    # filepaths
    pdf_file_path = '../filename.pdf'
    html_file_path = '../filename.html'
    xml_file_path = '../filename.xml'

    # declare extractor class
    text_extractor = extractor.MOFSynConditionExtractor()

    # PDF extraction

    list_of_paragraphs = text_extractor.read_file(pdf_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # html extraction

    list_of_paragraphs = text_extractor.read_file(html_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # xml extraction

    list_of_paragraphs = text_extractor.read_file(xml_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)

By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.

ML Model Performance (5-Fold Cross-Validation Averages)

Rank Model Avg Accuracy Avg Precision Notes
1 SVM_tfv 0.9905 0.8163 Best overall accuracy
2 NN_tfv 0.9903 0.8143 Default model
3 RF_tfv 0.9904 0.7730 High accuracy, lower precision
4 RF_CV 0.9902 0.7692 Stable but conservative
5 NN_CV 0.9889 0.8240 High precision
6 LR_tfv 0.9895 0.7853 Fast baseline
7 LR_CV 0.9885 0.8040 Balanced baseline
8 SVM_CV 0.9885 0.8124 Robust alternative
9 DT_CV 0.9865 0.7795 Interpretable
10 DT_tfv 0.9851 0.7692 Simple model
11 NB_CV 0.9837 0.8337 Highest precision
12 NB_tfv 0.9657 0.0232 Not recommended

To use any model, simply add the name of the model to the function. e.g

   list_of_paragraphs = text_extractor.read_file(xml_file_path)
   synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")

2. Extract paragaraph level synthetic condition from file

Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.

import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper

data_extractor = MOFSynConditionExtractor()

transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"

all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
    syn_data  = data_extractor.syn_data_from_document(file_path)
    for paragraph, data_style_1, data_style_2 in syn_data:
        transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
        standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})

LICENSE

MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mofsyncondition-0.1.2.tar.gz (38.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mofsyncondition-0.1.2-py3-none-any.whl (38.8 MB view details)

Uploaded Python 3

File details

Details for the file mofsyncondition-0.1.2.tar.gz.

File metadata

  • Download URL: mofsyncondition-0.1.2.tar.gz
  • Upload date:
  • Size: 38.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.2.tar.gz
Algorithm Hash digest
SHA256 1e02fe94f481735dc5be5b5a41111159cb322a33939eb43d06a3939205dfc7e2
MD5 412e94240cac803546bd345594775bb8
BLAKE2b-256 7ee511faee5f6db9a4c7ff3edebfce8638ebc6e0a1ec8891e64d34990160412a

See more details on using hashes here.

File details

Details for the file mofsyncondition-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mofsyncondition-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 38.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 15e1248599c1d355e08b484ba0445601b6e7c43c84f67210e19a558ede93e9e2
MD5 143475a20b60bdb17199af27c9e6db21
BLAKE2b-256 27f3560781e372e09169f4dbd0ff4198dd64fb21474d2a30b4ed46101dc72838

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page