Skip to main content

A Python package for chemical identifier resolution and experimental property extraction

Project description

PROVESID

Documentation Status Tests Python 3.8+ License: MIT

PROVESID is a member of the family of prePROcessing and VErification of Substance data PROVES. PROVESID provides Pythonic access to online services of chemical identifiers and data. The goal is to have a clean interface to the most important online databases with a simple, intuitive (and documented), up-to-date, and extendable interface. We offer interfaces to PubChem, NCI chemical identifier resolver, CAS Common Chemistry, IUPAC OPSIN, ChEBI, and ClassyFire. We highly recommend the new users to jump head-first into examples folder and get started by playing with the code. We also keep documenting the old and new functionalities here. The package also aims to provide an offline platform when data files are availbale from the mentioned online tools.

Installation

The package can be installed from PyPi by running

pip install provesid

To install the latest development version (for developers and enthusiasts, and also for the latest features), clone or download this repository, for to the root folder and install it by

pip install -e .

We very strongly recommend using uv. PROVESID is has a small Python codebase but its data files, when fully downloaded by the user's request, can occupy more than 30 Gb of disk space! uv makes sure that the package is installed only once and linked in other virtual environments. It barely changes your pip workflow, and is much faster -and more pleasant- to use. After installing uv, simply type:

uv pip install provesid

or for the development version (recommended for now):

uv pip install git+https://github.com/USEtox/PROVESID

Examples

PubChem

from provesid.pubchem import PubChemAPI
pc = PubChemAPI()  # Now with unlimited caching!
cids_aspirin = pc.get_cids_by_name('aspirin')
res_basic = pc.get_basic_compound_info(cids_aspirin[0])

which returns

{
  "CID": 2244,
  "MolecularFormula": "C9H8O4",
  "MolecularWeight": "180.16",
  "SMILES": "CC(=O)OC1=CC=CC=C1C(=O)O",
  "InChI": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
  "InChIKey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
  "IUPACName": "2-acetyloxybenzoic acid",
  "success": true,
  "cid": 2244,
  "error": null
}

PubChem View for data

from provesid import PubChemView, get_property_table
logp_table = get_property_table(cids_aspirin[0], "LogP")
logp_table

which returns a table with the reported values of logP for aspirin (including the references for each data point).

Chemical Identifier Resolver

from provesid import NCIChemicalIdentifierResolver
resolver = NCIChemicalIdentifierResolver()
# smiles for formaldehyde
smiles = resolver.resolve("50-00-0", 'smiles')
print(f"SMILES for CASRN 50-00-0 is {smiles}") # SMILES for CASRN 50-00-0 is C=O
# inchi for aspirin
inchi = resolver.resolve("50-78-2", "stdinchi") # InChI for 50-78-2 is InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
print(f"InChI for 50-78-2 is {inchi}")

OPSIN This is the online OPSIN interface. A local interface also exist that uses py2opsin python package and the JAVA executables of the OPSIN library. You can use the local version (recommended) by loading the PYOPSIN clss instead of OPSIN.

from provesid import OPSIN
opsin = OPSIN()
methane_result = opsin.get_id("methane")

which returns:

{'status': 'SUCCESS',
 'message': '',
 'inchi': 'InChI=1/CH4/h1H4',
 'stdinchi': 'InChI=1S/CH4/h1H4',
 'stdinchikey': 'VNWKTOKETHGBQD-UHFFFAOYSA-N',
 'smiles': 'C'}

CAS Common Chemistry

# One-time API key setup
from provesid import set_cas_api_key
set_cas_api_key("your-cas-api-key")  # Configure once

# Then use anywhere without specifying API key
from provesid import CASCommonChem
ccc = CASCommonChem()  # Automatically uses stored API key
water_info = ccc.cas_to_detail("7732-18-5")
print("Water (7732-18-5):")
print(f"  Name: {water_info.get('name')}")
print(f"  Molecular Formula: {water_info.get('molecularFormula')}")
print(f"  Molecular Mass: {water_info.get('molecularMass')}")
print(f"  SMILES: {water_info.get('smile')}")
print(f"  InChI: {water_info.get('inchi')}")
print(f"  Status: {water_info.get('status')}")

which returns

Water (7732-18-5):
  Name: Water
  Molecular Formula: H<sub>2</sub>O
  Molecular Mass: 18.02
  SMILES: O
  InChI: InChI=1S/H2O/h1H2
  Status: Success

ChEBI

Access to the European Bioinformatics Institute ChEBI (Chemical Entities of Biological Interest) database. See the tutorial notebook.

ZeroPM Global Chemical Inventory

PROVESID now includes access to the ZeroPM global chemical inventory database, which provides information about chemicals listed in regulatory inventories worldwide. The database is automatically downloaded on first use:

from provesid.zeropm import ZeroPM

# Initialize - database downloads automatically if not present
zpm = ZeroPM()

# Query by CAS number
query_id = zpm.query_cas("50-00-0")  # Formaldehyde

# Get SMILES from CAS
smiles = zpm.get_smiles_from_cas("50-00-0")

# Search by chemical name
results = zpm.query_similar_name("formaldehyde", threshold=80)

# Query by regulatory inventory
eu_chemicals = zpm.query_by_inventory(inventory_name="REACH")

# Query by country
us_chemicals = zpm.query_by_country(country_name="United States")

# Get all available inventories
inventories = zpm.get_all_inventories()

# Get database statistics
stats = zpm.get_database_stats()

The database file (~400MB) is downloaded automatically from GitHub on first use and cached locally. You can also manually download it:

# Manual download (only needed if auto-download fails)
zpm = ZeroPM(auto_download=False)  # Skip auto-download
zpm.download_database()  # Manually trigger download

See the ZeroPM tutorial notebook for more examples.

ClassyFire

See the tutorial notebook.

Other tools

Several other Python (and other) packages and sample codes are available. We are inspired by them and tried to improve upon them based on our personal experiences working with chemical identifiers and data.

TODO list

We will provide Python interfaces to more online services. Please open an issue and let us know what else you would like to have included.

Add data and tool for Chebi ontology data using pronto
Add an interface to the ChEMBL standardization pipeline using its Python package; this feature may be added to IMPROVES.

Add UniChem API

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

provesid-0.3.0.tar.gz (7.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

provesid-0.3.0-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file provesid-0.3.0.tar.gz.

File metadata

  • Download URL: provesid-0.3.0.tar.gz
  • Upload date:
  • Size: 7.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for provesid-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a529f2b56a762e49ecdb2e5235f9f2381e755a0a168e911a64d25f7e995b1abf
MD5 715a60179033b23efad7e261504dbe97
BLAKE2b-256 1fb71d06f1352261b4fadde2df199bf7931376fbc8215051e8726c72c91c54db

See more details on using hashes here.

File details

Details for the file provesid-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: provesid-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for provesid-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e7d70a19c958281d6933e66154527239504591885f695f5e82f3c626f56fb5d
MD5 16ac051a8c2720864b2f8cb1a500423a
BLAKE2b-256 50acc4980e0217e9750cf96f73d619d6232be9ea38c38a68b068acdc8c799d25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page