Skip to main content

A python too for extracting synthesis condition of MOFs directly from journal articles in any file format (html, xml and pdf

Project description

mofsyncondition

mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.

The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.


Overview

Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery. mofsyncondition addresses this problem by:

  • Reading journal articles in HTML, pdf or xml format
  • Identifying synthesis-related paragraphs using ML-based classification
  • Extracting structured synthesis conditions from unstructured text
  • Generating datasets suitable for machine learning and LLM training

Key Features

  • Support for HTML and PDF-derived text inputs
  • ML-based identification of synthesis protocols
  • Regex-driven extraction of synthesis conditions
  • Modular and extensible Python design
  • Scalable for large literature datasets

Extracted Synthesis Information

The module aims to extract synthesis parameters such as:

  • Metal precursors
  • Organic linkers
  • Solvents
  • Additives / modulators
  • Reaction temperature
  • Reaction time
  • pH (when available)
  • Synthetic methods (e.g. solvothermal, hydrothermal)
  • Pressure and humidity (when available)
  • Name of MOF or formular is provided

Installation

Clone the repository and install the package locally:

git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .

PYPI

The module can be install using PYPI

   pip install mofsyncondition

Usage

1. Extract synthetic paragraph from file

Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.

    from mofsyncondition.synthesis_conditions import extractor

    # filepaths
    pdf_file_path = '../filename.pdf'
    html_file_path = '../filename.html'
    xml_file_path = '../filename.xml'

    # declare extractor class
    text_extractor = extractor.MOFSynConditionExtractor()

    # PDF extraction

    list_of_paragraphs = text_extractor.read_file(pdf_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # html extraction

    list_of_paragraphs = text_extractor.read_file(html_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)


    # xml extraction

    list_of_paragraphs = text_extractor.read_file(xml_file_path)
    synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)

By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.

ML Model Performance (5-Fold Cross-Validation Averages)

Rank Model Avg Accuracy Avg Precision Notes
1 SVM_tfv 0.9905 0.8163 Best overall accuracy
2 NN_tfv 0.9903 0.8143 Default model
3 RF_tfv 0.9904 0.7730 High accuracy, lower precision
4 RF_CV 0.9902 0.7692 Stable but conservative
5 NN_CV 0.9889 0.8240 High precision
6 LR_tfv 0.9895 0.7853 Fast baseline
7 LR_CV 0.9885 0.8040 Balanced baseline
8 SVM_CV 0.9885 0.8124 Robust alternative
9 DT_CV 0.9865 0.7795 Interpretable
10 DT_tfv 0.9851 0.7692 Simple model
11 NB_CV 0.9837 0.8337 Highest precision
12 NB_tfv 0.9657 0.0232 Not recommended

To use any model, simply add the name of the model to the function. e.g

   list_of_paragraphs = text_extractor.read_file(xml_file_path)
   synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")

2. Extract paragaraph level synthetic condition from file

Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.

import spacy
from mofsyncondition.synthesis_conditions.extractor import MOFSynConditionExtractor
from mofsyncondition.io import filetyper

data_extractor = MOFSynConditionExtractor()

transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"

all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
    syn_data  = data_extractor.syn_data_from_document(file_path)
    for paragraph, data_style_1, data_style_2 in syn_data:
        transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
        standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})

LICENSE

MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mofsyncondition-0.1.1.tar.gz (38.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mofsyncondition-0.1.1-py3-none-any.whl (38.7 MB view details)

Uploaded Python 3

File details

Details for the file mofsyncondition-0.1.1.tar.gz.

File metadata

  • Download URL: mofsyncondition-0.1.1.tar.gz
  • Upload date:
  • Size: 38.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3899a8b0f1b426c2ff3814069efccd50cc5e8c5a49b96dd3ef9ff71cc52274ac
MD5 3cffef3d6df84ad1f5586019a53bfdf5
BLAKE2b-256 67e9a61519007f8d152dc4710035d857b700a5daebbd0a7694f33dbb79f17c9b

See more details on using hashes here.

File details

Details for the file mofsyncondition-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mofsyncondition-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 38.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0

File hashes

Hashes for mofsyncondition-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4796b8e2ebe7578ae7151a670c34bd59de864dd1f96ba2a180300e0a237d4ab2
MD5 c828462c7ec49c2687f34f90cede32d7
BLAKE2b-256 62d95c87bd970179683775696f7bff97f1ac5262cfaa9c3804edcb52b134cf89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page