A python too for extracting synthesis condition of MOFs directly from journal articles in any file format (html, xml and pdf
Project description
mofsyncondition
mofsyncondition is a Python module for automatically extracting synthesis conditions of metal–organic frameworks (MOFs) from scientific journal articles.
The module reads HTML files or PDF-derived text files, uses machine learning models to identify paragraphs describing synthetic protocols and then extracts relevant synthesis conditions. In its current state, the extraction of synthesis conditions is primarily performed using intelligent regular expressions. The resulting dataset is being used to fine-tune a large language model (LLM) for MOFs.
Overview
Extracting synthesis conditions from MOF literature is a key challenge in data-driven materials discovery.
mofsyncondition addresses this problem by:
- Reading journal articles in HTML, pdf or xml format
- Identifying synthesis-related paragraphs using ML-based classification
- Extracting structured synthesis conditions from unstructured text
- Generating datasets suitable for machine learning and LLM training
Key Features
- Support for HTML and PDF-derived text inputs
- ML-based identification of synthesis protocols
- Regex-driven extraction of synthesis conditions
- Modular and extensible Python design
- Scalable for large literature datasets
Extracted Synthesis Information
The module aims to extract synthesis parameters such as:
- Metal precursors
- Organic linkers
- Solvents
- Additives / modulators
- Reaction temperature
- Reaction time
- pH (when available)
- Synthetic methods (e.g. solvothermal, hydrothermal)
- Pressure and humidity (when available)
- Name of MOF or formular is provided
Named Entity Recognition for Chemical Reagents
In addition to intelligent regular expressions, mofsyncondition uses a trained spaCy Named Entity Recognizer (NER) to identify chemical reagents and synthesis-related entities directly from raw text and paragraph inputs.
The model, en_mof_chem_ner, is specialized for MOF literature and recognizes the following domain-specific entity types:
| Component | Labels |
|---|---|
ner |
ATMOSPHERE, METAL_SALT, MODULATOR, MOF, ORGANIC_LIGAND, SOLVENT, SYNTH_METHOD |
This NER layer enables reliable extraction of:
- Metal precursors and salts
- Organic ligands / linkers
- Solvents and modulators
- Synthetic methods (e.g., solvothermal, hydrothermal)
- Reaction atmosphere (e.g., air, nitrogen, argon)
- MOF names (when explicitly stated)
These structured entities are then combined with regex-based extraction to produce high-quality synthesis-condition datasets for machine learning and LLM fine-tuning.
NER Model Performance
Overall evaluation scores on held-out data:
| Metric | Score |
|---|---|
ENTS_F |
91.66 |
ENTS_P |
92.78 |
ENTS_R |
90.56 |
TOK2VEC_LOSS |
26365.16 |
NER_LOSS |
78555.25 |
Per-Entity Performance
| Entity Type | Precision (P) | Recall (R) | F1-score (F) |
|---|---|---|---|
| METAL_SALT | 0.9292 | 0.9082 | 0.9186 |
| ORGANIC_LIGAND | 0.7600 | 0.7157 | 0.7372 |
| SOLVENT | 0.9815 | 0.9900 | 0.9857 |
| MODULATOR | 0.9722 | 0.9560 | 0.9640 |
| ATMOSPHERE | 0.9715 | 0.9662 | 0.9689 |
| SYNTH_METHOD | 0.9970 | 0.9941 | 0.9955 |
| MOF | 0.6797 | 0.4973 | 0.5744 |
Installation
Clone the repository and install the package locally:
git clone https://github.com/bafgreat/mofsyncondition.git
cd mofsyncondition
pip install .
PYPI
The module can be install using PYPI
pip install mofsyncondition
Usage
1. Extract synthetic paragraph from file
Assuming you have different files and wish to extract list of paragraphs describing synthesis simply run the following code.
from mofsyncondition.synthesis_conditions import extractor
# filepaths
pdf_file_path = '../filename.pdf'
html_file_path = '../filename.html'
xml_file_path = '../filename.xml'
# declare extractor class
text_extractor = extractor.MOFSynConditionExtractor()
# PDF extraction
list_of_paragraphs = text_extractor.read_file(pdf_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)
# html extraction
list_of_paragraphs = text_extractor.read_file(html_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)
# xml extraction
list_of_paragraphs = text_extractor.read_file(xml_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs)
By default the paragraph sentiment model uses NN_tfv. Below is a list of other models.
ML Model Performance (5-Fold Cross-Validation Averages)
| Rank | Model | Avg Accuracy | Avg Precision | Notes |
|---|---|---|---|---|
| 1 | SVM_tfv | 0.9905 | 0.8163 | Default model |
| 2 | NN_tfv | 0.9903 | 0.8143 | |
| 3 | RF_tfv | 0.9904 | 0.7730 | High accuracy, lower precision |
| 4 | RF_CV | 0.9902 | 0.7692 | Stable but conservative |
| 5 | NN_CV | 0.9889 | 0.8240 | High precision |
| 6 | LR_tfv | 0.9895 | 0.7853 | Fast baseline |
| 7 | LR_CV | 0.9885 | 0.8040 | Balanced baseline |
| 8 | SVM_CV | 0.9885 | 0.8124 | Robust alternative |
| 9 | DT_CV | 0.9865 | 0.7795 | Interpretable |
| 10 | DT_tfv | 0.9851 | 0.7692 | Simple model |
| 11 | NB_CV | 0.9837 | 0.8337 | Highest precision |
| 12 | NB_tfv | 0.9657 | 0.0232 | Not recommended |
To use any model, simply add the name of the model to the function. e.g
list_of_paragraphs = text_extractor.read_file(xml_file_path)
synthetic_paragraphs = text_extractor.get_synthetic_paragraph(list_of_paragraphs, model="NN_CV")
2. Extract paragaraph level synthetic condition from file
Suppose you have an document (pdf, html, xml) and wish to extract all synthesis conditions. The below lines of code it the faster way to do so. This is faster than using transformer models and take large documents and parse thousand of files.
import spacy
from mofsyncondition.synthesis_conditions.mof_synthesis_conditions import MOFSynConditionExtractor
from mofsyncondition.io import filetyper
data_extractor = MOFSynConditionExtractor()
transformer_dataset = []
standard_dataset = []
file_path = "./data_test/Test2.pdf"
all_files = ["./data_test/Test2.pdf", "./data_test/ABAFUH.xml", "./data_test/Test3.html"]
for file_path in all_files:
syn_data = data_extractor.syn_data_from_document(file_path)
for paragraph, data_style_1, data_style_2 in syn_data:
transformer_dataset.append({'paragraph':paragraph, "condition":data_style_1})
standard_dataset.append({'paragraph':paragraph, "condition":data_style_2})
filetyper.list_2_json(transformer_dataset, 'transformer_dataset.jsonl')
filetyper.list_2_json(standard_dataset, 'standard_dataset.json')
LICENSE
MIT license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mofsyncondition-0.1.3.tar.gz.
File metadata
- Download URL: mofsyncondition-0.1.3.tar.gz
- Upload date:
- Size: 38.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b5abe6d8f619f413e02d72e42f161c8b8f5158d5530a936ac8cbaefe402e2f8
|
|
| MD5 |
293ee1598c4c36108da305bb7a0b69db
|
|
| BLAKE2b-256 |
19835946c62cce5d75053fd9be3d0db0b01aeb622164a5e8cefe7abb6d8e6202
|
File details
Details for the file mofsyncondition-0.1.3-py3-none-any.whl.
File metadata
- Download URL: mofsyncondition-0.1.3-py3-none-any.whl
- Upload date:
- Size: 38.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.8 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7797650044e7822582060f75143c78bc00c9fe4877061130782d1b746894892
|
|
| MD5 |
8b748ba2ba8e29ad5d910090b79512ba
|
|
| BLAKE2b-256 |
f608e382c725703fa2329464f38645f75b8541fe47785839494e856b1cd68276
|