Skip to main content

Convert chemical molecule data CSV files to structured data formats

Project description

Molstruct logo

Converts chemical molecule data Comma Separated Values (CSV) files to structured data formats - JSON-LD, RDFa and Microdata. Supported CSV columns: identifier, name, inChIKey, inChI, smiles, url, iupacName, molecularFormula, molecularWeight, monoisotopicMolecularWeight, description, disambiguatingDescription, image, additionalType, alternateName and sameAs. Works from CLI on Python 3.2 and above. Molstruct is lightweight. No additional dependencies are required.

What are structured data

Structured data are additional data placed on websites. They are not visible to ordinary internet users, but can be easily processed by machines. There are 3 formats that we can use to save structured data - JSON-LD, RDFa and Microdata. Molstruct supports them all and use MolecularEntitly type.

Where to find a CSV file with molecule data

There are many possibilities. The easiest way is to download a CSV file from one of the chemical databases, e.g. DrugBank. You can also create the CSV file yourself.

Installation

You can install the Molstruct from PyPI:

pip install molstruct

Python 3.2 and above are supported. No additional dependencies are required.

Usage

usage: molstruct [-h] (-jh | -j | -r | -m) [-i IDENTIFIER] [-n NAME] [-ink INCHIKEY]
                 [-in INCHI] [-s SMILES] [-u URL] [-iu IUPACNAME]
                 [-f MOLECULARFORMULA] [-w MOLECULARWEIGHT]
                 [-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]
                 [-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-at ADDITIONALTYPE]
                 [-an ALTERNATENAME] [-sa SAMEAS] [-c] [-l LIMIT]
                 file

Positional arguments

file                  CSV file with molecule data to convert

Optional arguments

  -h, --help            show this help message and exit
  -jh, --jsonldhtml     JSON-LD with HTML output
  -j, --jsonld          JSON-LD output
  -r, --rdfa            RDFa output
  -m, --microdata       Microdata output
  -i IDENTIFIER, --identifier IDENTIFIER
                        identifier column name (identifier by default), Text
  -n NAME, --name NAME  name column name (name by default), Text
  -ink INCHIKEY, --inChIKey INCHIKEY
                        inChIKey column name (inChIKey by default), Text
  -in INCHI, --inChI INCHI
                        inChI column name (inChI by default), Text
  -s SMILES, --smiles SMILES
                        smiles column name (smiles by default), Text
  -u URL, --url URL     url column name (url by default), URL type
  -iu IUPACNAME, --iupacName IUPACNAME
                        iupacName column name (iupacName by default), Text
  -f MOLECULARFORMULA, --molecularFormula MOLECULARFORMULA
                        molecularFormula column name (molecularFormula by
                        default), Text
  -w MOLECULARWEIGHT, --molecularWeight MOLECULARWEIGHT
                        molecularWeight column name (molecularWeight by
                        default), Mass e.g. 0.01 mg)
  -mw MONOISOTOPICMOLECULARWEIGHT, --monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHT
                        monoisotopicMolecularWeight column name
                        (monoisotopicMolecularWeight by default), Mass e.g.
                        0.01 mg
  -d DESCRIPTION, --description DESCRIPTION
                        description column name (description by default), Text
  -dd DISAMBIGUATINGDESCRIPTION, --disambiguatingDescription DISAMBIGUATINGDESCRIPTION
                        disambiguatingDescription column name
                        (disambiguatingDescription by default), Text
  -img IMAGE, --image IMAGE
                        image column name (image by default), URL
  -at ADDITIONALTYPE, --additionalType ADDITIONALTYPE
                        additionalType column name (additionalType by
                        default), URL
  -an ALTERNATENAME, --alternateName ALTERNATENAME
                        alternateName column name (alternateName by default),
                        Text
  -sa SAMEAS, --sameAs SAMEAS
                        sameAs column name (sameAs by default), URL
  -c, --columns         Use only columns with renamed names
  -l LIMIT, --limit LIMIT
                        Maximum number of results

Available options may vary depending on the version. To display all available options with their descriptions use molstruct -h.

Examples

molstruct --rdfa data.csv

Returns simple HTML with added RDFa. Assumes that the column names in CSV file are the default ones.

molstruct --microdata -f "formula" data.csv

Returns simple HTML with added Microdata. Assumes that the column names in CSV file are the default ones but replaces default molecularformula column name by formula.

molstruct --microdata --columns --id "CAS" --name "Common name" --inchikey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv"

Returns simple HTML with added Microdata. When generating a file, only selected columns will be taken into account. A limit of 50 molecules has been specified.

molstruct --microdata --columns --id "CAS" --name "Common name" --inchikey "Standard InChI Key" --limit 50 "drugbank vocabulary.csv" > output.html

Do the same as example above but save results to output.html.

Contribution

Would you like to improve this project? Great! We are waiting for your help and suggestions. If you are new in open source contributions, read How to Contribute to Open Source.

License

Distributed under MIT license.

See also

These projects can also be useful:

  • SDFEater - Always hungry SDF chemical file format parser with many output formats
  • MEgen - Convenient online form to generate structured data about molecules

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molstruct-1.0.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molstruct-1.0.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file molstruct-1.0.0.tar.gz.

File metadata

  • Download URL: molstruct-1.0.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for molstruct-1.0.0.tar.gz
Algorithm Hash digest
SHA256 63bb9451e1eb8dfa72d304d20a5c39a92762861e605f914671d05b8f723b3040
MD5 ebf3e4d24da2f2ad5241e307fa355876
BLAKE2b-256 f7ebc626b8c9f639a23136dc8e2a49aa02986269bea95acfcf9353b8a86ef6b7

See more details on using hashes here.

File details

Details for the file molstruct-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: molstruct-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for molstruct-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cf1554548f3e3e8d3acdde8bcdc002bb4d280b8e2fa048993698dd350c5bfd9
MD5 6882bd63eb54a292eff5f59f1bd6636a
BLAKE2b-256 c263d979bd5edcdcb20a71179e62c1f6480ae6753c127ae6c4373dd8c06ed2ee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page