Convert chemical molecule data CSV files to structured data formats
Project description
Molstruct is a lightweight Python CLI tool that converts chemical molecule data Comma Separated Values (CSV) files to structured data formats - JSON-LD, RDFa, and Microdata. Molstruct has a lot of customization options that you can, but don't have to use. Python 3.2+ is supported and no dependencies are required. Sounds good so far? What would you say to a really tiny Molstruct Docker container? Just try Molstruct!
What is structured data
Structured data is additional data placed on websites. It is not visible to ordinary internet users but can be easily processed by machines. There are 3 formats that we can use to save structured data - JSON-LD, RDFa, and Microdata. Molstruct supports them all and uses the MolecularEntity profile.
Where to find a CSV file with molecule data
There are many possibilities. The easiest way is to download a CSV file from one of the chemical databases, e.g. DrugBank. You can also create the CSV file yourself.
Quick start
Use Molstruct in 3 easy steps. In this example, we will use the DrugBank open dataset. You need Python 3.2+ and pip installed.
- Open a terminal and install Molstruct
You can install the Molstruct from PyPI:
pip install molstruct
Molstruct is also available as a Docker image. In most cases, installing Molstruct from PyPI or using Docker should be sufficient and convenient, but you may want to run Molstruct from sources or build a Docker image yourself.
- Download DrugBank open dataset in CSV format and unzip downloaded archive.
- Molstruct has a predefined preset for this dataset. You just need to select the output format and enter the path to the CSV file. Assuming the
drugbank vocabulary.csv
file is in the current directory and the output format you're interested in is RDFa, the command will be as follows:
molstruct -p drugbank-open -f rdfa "drugbank vocabulary.csv" > drugbank_cc0_rdfa.html
That's all. Now you have the RDFa file ready in the current directory. You can try other output formats and options as described below. You can also use Molstruct to convert other data in CSV format.
Docker image
If you have Docker installed, you can use a tiny Molstruct image from Docker Hub.
Because the tool is closed inside the container, you have to mount the local directory with your input file. The default working directory of the image is /app
. You need to mount your local directory inside it (e.g. /app/input
):
docker run -it --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest
In this case, the local directory /home/user/input
has been mounted under /app/input
.
You can also simply mount the current working directory using $(pwd)
sub-command:
docker run -it --rm --name molstruct-app --mount type=bind,source="$(pwd)",target=/app/input,readonly lszeremeta/molstruct:latest
Usage
usage: molstruct [-h] [--version] -f {jsonldhtml,jsonld,rdfa,microdata} [-i IDENTIFIER]
[-n NAME] [-ink INCHIKEY] [-in INCHI] [-sm SMILES] [-u URL]
[-iu IUPACNAME] [-mf MOLECULARFORMULA] [-w MOLECULARWEIGHT]
[-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]
[-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-an ALTERNATENAME]
[-sa SAMEAS] [-p {drugbank-open} | -c] [-s {iri,uuid,bnode}] [-b BASE]
[-vd VALUE_DELIMITER] [-l LIMIT]
file
Supported MolecularEntity properties that correspond to default CSV column names: identifier
, name
, inChIKey
, inChI
, smiles
, url
, iupacName
, molecularFormula
, molecularWeight
, monoisotopicMolecularWeight
, description
, disambiguatingDescription
, image
, alternateName
and sameAs
. You can rename the columns if needed (see Column name change arguments below). You can also use a preset with the appropriate settings for your dataset.
Informative arguments
-h
,--help
show help message and exit--version
show program version and exit
Required arguments
-f {jsonldhtml,jsonld,rdfa,microdata}
,--format {jsonldhtml,jsonld,rdfa,microdata}
output formatfile
CSV file path with molecule data to convert
Remember about the appropriate file path when using the Docker image. Suppose you mounted your local directory /home/user/input
under /app/input
and the path to the CSV file you want to use in Molstruct is /home/user/input/file.csv
. In this case, enter the path /app/input/file.csv
or input/file.csv
as file
argument value.
Column name change arguments
Arguments for changing the default column names
-i IDENTIFIER
,--identifier IDENTIFIER
identifier column name ('identifier' by default), Text-n NAME
,--name NAME
name column name ('name' by default), Text-ink INCHIKEY
,--inChIKey INCHIKEY
inChIKey column name ('inChIKey' by default), Text-in INCHI
,--inChI INCHI
inChI column name ('inChI' by default), Text-sm SMILES
,--smiles SMILES
smiles column name ('smiles' by default), Text-u URL
,--url URL
url column name ('url' by default), URL-iu IUPACNAME
,--iupacName IUPACNAME
iupacName column name ('iupacName' by default), Text-mf MOLECULARFORMULA
,--molecularFormula MOLECULARFORMULA
molecularFormula column name ('molecularFormula' by default), Text-w MOLECULARWEIGHT
,--molecularWeight MOLECULARWEIGHT
molecularWeight column name ('molecularWeight' by default), Mass e.g. 0.01 mg)-mw MONOISOTOPICMOLECULARWEIGHT
,--monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHT
monoisotopicMolecularWeight column name ('monoisotopicMolecularWeight' by default), Mass e.g. 0.01 mg-d DESCRIPTION
,--description DESCRIPTION
description column name ('description' by default), Text-dd DISAMBIGUATINGDESCRIPTION
,--disambiguatingDescription DISAMBIGUATINGDESCRIPTION
disambiguatingDescription column name ('disambiguatingDescription' by default), Text-img IMAGE
,--image IMAGE
image column name ('image' by default), URL-an ALTERNATENAME
,--alternateName ALTERNATENAME
alternateName column name ('alternateName' by default), Text-sa SAMEAS
,--sameAs SAMEAS
sameAs column name ('sameAs' by default), URL
Additional settings arguments
-p {drugbank-open}
,--preset {drugbank-open}
apply presets for individual CSV sources to avoid setting individual options manually ('drugbank-open')-c
,--columns
use only columns with renamed names; not available when using a preset-s {iri,uuid,bnode}
,--subject {iri,uuid,bnode}
molecule subject type ('iri' by default)-b BASE
,--base BASE
molecule subject base for 'iri' subject type ('http://example.com/molecule#entity' by default)-vd VALUE_DELIMITER
,--value-delimiter VALUE_DELIMITER
value delimiter (' | ' by default)-l LIMIT
,--limit LIMIT
maximum number of results (unlimited by default)
Available options may vary depending on the version. To display all available options with their descriptions use molstruct -h
.
Predefined presets
To make your work easier, Molstruct has built-in preset support. Thanks to this, you do not have to set everything manually, you just select the appropriate preset and it's ready. The presets are flexible. If you want to change, e.g. the column names selected for a preset, you can do so. At the moment you can use the DrugBank open preset. There are plans to add more in the future. Any suggestions are welcome!
drugbank-open
Settings for the open DrugBank dataset in CSV file:
--value-delimiter
is set to ' | '--identifier
is set to 'CAS'--name
is set to 'Common name'--inChIKey
is set to 'Standard InChI Key'--alternateName
is set to 'Synonyms'
Additional examples
molstruct -f jsonldhtml data.csv
Returns simple HTML with added JSON-LD. Assumes that the column names in the CSV file are the default ones.
molstruct -f microdata -mf "formula" data.csv
Returns simple HTML with added Microdata. Assumes that the column names in CSV file are the default ones but replaces default molecularformula
column name by formula
.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv"
Returns simple HTML with added Microdata. When generating a file, only selected columns will be taken into account. A limit of 50 molecules has been specified.
molstruct -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv" > output.html
Does the same as the example above but saves results to output.html
.
docker run -it --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest -f microdata --columns --id "CAS" --name "Common name" --inChIKey "Standard InChI Key" --limit 50 "input/drugbank vocabulary.csv" > output.html
Does the same as the example above (run from pre-built Docker image).
Returns simple HTML with added Microdata and redirects output to molecules.html
file. Run from pre-build Docker image.
Contribution
Would you like to improve this project? Great! We are waiting for your help and suggestions. If you are new to open source contributions, read How to Contribute to Open Source.
License
Distributed under MIT License.
See also
These projects can also be useful:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file molstruct-3.0.0.tar.gz
.
File metadata
- Download URL: molstruct-3.0.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc27fe484d7de7b3076bdec04330fdf63b0dcae554e18ac90af5850b281886d5 |
|
MD5 | eae42c72b9fbda484c286f2f7a03732a |
|
BLAKE2b-256 | a3c9a80e2ba861389dc43ae0d92c61d1d78fd3fbf2dfc3f6651d5d979cf30ecf |
File details
Details for the file molstruct-3.0.0-py3-none-any.whl
.
File metadata
- Download URL: molstruct-3.0.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81e7d1bfd4dece86748d6f30340fd89bd77577fb579dca349a97bf471c64ea69 |
|
MD5 | b8996e326fb9f24b3dce3881a8e6bdd6 |
|
BLAKE2b-256 | 1b7b59500c2cc0094fe45b7d216bb8b5b01099338615b8228160fb46d905380c |