Skip to main content

Data Integration and Processing for Computational Toxicology

Project description

DiPTox - Data Integration and Processing for Computational Toxicology

DiPTox is a Python toolkit designed for the robust preprocessing, standardization, and multi-source data integration of molecular datasets, with a focus on computational toxicology workflows.

Official Release v1.0

We are excited to announce the first official stable release of DiPTox on PyPI! This milestone brings production-ready stability and performance enhancements:

  • Multi-Process Acceleration:
    • Accelerate chemical preprocessing tasks by 10x or more using the n_jobs parameter.
    • Intelligent task distribution across CPU cores for heavy datasets.
  • Cross-Platform Robustness:
    • Implemented a specialized "Guard Mechanism" for Windows multiprocessing to prevent memory explosion and recursive process spawning issues.
    • Verified stability across Windows, Linux, and macOS environments.
  • Enhanced Data Loading:
    • Switched to binary stream parsing for .sdf and .mol files to resolve encoding crashes (e.g., utf-8 vs latin-1).
    • Auto-parsing of molecular structures to generate SMILES even when properties are missing.

DiPTox Community Check-in (Optional)

To help us understand our user base and improve the software, DiPTox includes a one-time, optional survey on first use.

  • Completely Optional: You can skip it with a single click.
  • Privacy-Focused: The information helps us with academic impact assessment. It will not be shared.

Core Features

1. Graphical User Interface (GUI)

Powered by Streamlit, the GUI allows users to perform all workflows visually without writing code.

  • Visual Operation: Complete workflow control via a web browser.
  • Real-time Preview: Instantly view data changes after applying rules.
  • Rule Management: Add/Remove valid atoms, salts, solvents, and unit conversion formulas interactively.
  • Smart Column Mapping: Intelligent detection of headers and binary file structures.

2. Chemical Preprocessing & Standardization

A configurable pipeline to clean and normalize chemical structures.

  • Strict Inorganic Filtering: Updated SMARTS matching to accurately identify complex inorganic species (e.g., ionic cyanides) without misclassifying organic nitriles.
  • Pipeline Steps:
    • Remove salts & solvents
    • Handle mixtures (keep largest fragment)
    • Remove inorganic molecules
    • Neutralize charges & Validate atomic composition
    • Remove explicit hydrogens, stereochemistry, and isotopes
    • Reject Radical Species: Automatically discard molecules containing free radical atoms.
    • Standardize to canonical SMILES
    • Filter by atom count

3. Unit Standardization

Normalize heterogeneous target data into a single unit effortlessly.

  • Automatic Conversion: Built-in rules for Concentration, Time, Pressure, and Temperature.
  • Custom Formulas: Define mathematical rules (e.g., x * 1000 or 10**(-x)) interactively via GUI or script.
  • Unified Output: Standardize diverse units (e.g., ug/mL, g/L, M) to a single target (e.g., mg/L).

4. Data Deduplication

Flexible strategies for handling duplicate entries with advanced controls.

  • Data Types: Supports continuous (e.g., IC50) and discrete (e.g., Active/Inactive) targets.
  • Methods: auto, IQR, 3sigma, vote, or custom priority rules.
  • Log Transformation: Optional -log10 transformation (e.g., IC50 $\to$ pIC50) applied before deduplication logic to handle bioactivity data correctly.
  • Flexible NaN Handling: Option to retain rows with missing conditions (treating NaN as a valid group) instead of dropping them.

5. Comprehensive History Tracking (Audit Log)

  • Records every operation (Loading, Preprocessing, Filtering, etc.) in an Audit Log.
  • Tracks timestamps, operation details, and row count changes (Delta) step-by-step.
  • Available via API (get_history()) and visualized in the GUI.

6. Identifier & Property Integration

  • Fetch and interconvert identifiers (CAS, SMILES, IUPAC, MW) from multiple sources (PubChem, ChemSpider, CompTox, Cactus, CAS Common Chemistry, ChEMBL).
  • High-performance concurrent requests with automatic rate limiting and retries.

7. Utility Tools

  • Perform substructure searches using SMILES or SMARTS patterns.
  • Customize chemical processing rules for neutralization reactions, salt/solvent lists, and valid atoms.
  • Display a summary of all currently active processing rules.

Installation

Install the official stable version from PyPI:

pip install diptox

GUI

After installation, you can launch the graphical interface directly from your terminal:

diptox-gui

This command will automatically open the DiPTox interface in your default web browser.

Quick Start

from diptox import DiptoxPipeline

def main():
    # Initialize processor
    DP = DiptoxPipeline()

    # Load data
    DP.load_data(input_data='file_path/list/dataframe', smiles_col, target_col, cas_col, unit_col)

    # Customize Processing Rules (Optional)
    print("--- Default Rules ---")
    DP.display_processing_rules()

    DP.manage_atom_rules(atoms=['Si'], add=True)         # Add 'Si' to the list of valid atoms
    DP.manage_default_salt(salts=['[Na+]'], add=False)   # Example: remove sodium from the salt list
    DP.manage_default_solvent(solvents='Cl', add=False)  # Example: remove chlorine from the solvent list
    DP.add_neutralization_rule('[$([N-]C=O)]', 'N')      # Add a custom neutralization rule

    print("\n--- Customized Rules ---")
    DP.display_processing_rules()

    # Configure preprocessing
    DP.preprocess(
        remove_salts=True,              # Remove salt fragments. Default: True.
        remove_solvents=True,           # Remove solvent fragments. Default: True.
        remove_mixtures=True,           # Handle mixtures based on fragment size. Default: False.
        hac_threshold=3,                # Heavy atom count threshold for fragment removal. Default: 3.
        keep_largest_fragment=True,     # Keep the largest fragment in a mixture. Default: True.
        remove_inorganic=False,         # Remove common inorganic molecules. Default: True.
        neutralize=True,                # Neutralize charges on the molecule. Default: True.
        reject_non_neutral=False,       # Only retain the molecules whose formal charge is zero. Default: False.
        check_valid_atoms=True,         # Check if all atoms are in the valid list. Default: False.
        strict_atom_check=False,        # If True, discard molecules with invalid atoms. If False, try to remove them. Default: False.
        remove_stereo=False,            # Remove stereochemistry information. Default: False.
        remove_isotopes=True,           # Remove isotopic information. Default: True.
        remove_hs=True,                 # Remove explicit hydrogen atoms. Default: True.
        reject_radical_species=True,    # Molecules containing free radical atoms are directly rejected. Default: True.
        n_jobs=4                        # Accelerate using 4 CPU cores. Default: 1 
    )

    # Configure deduplication and unit standardization
    conversion_rules = {('g/L', 'mg/L'): 'x * 1000', 
                        ('ug/L', 'mg/L'): 'x / 1000',}
    DP.config_deduplicator(condition_cols, data_type, method, custom_method, priority, standard_unit, conversion_rules, log_transform, dropna_conditions)
    DP.dataset_deduplicate()

    # Configure web queries
    DP.config_web_request(sources=['pubchem/chemspider/comptox/cactus/cas'], max_workers, ...)
    DP.web_request(send='cas', request=['smiles', 'iupac'])

    # Substructure search
    DP.substructure_search(query_pattern, is_smarts=True)

    # Save results
    DP.save_results(output_path='file_path')

    # View Processing History (Audit Log)
    print(DP.get_history())
    # Output Example:
    #               Step Timestamp  Rows Before  Rows After   Delta                               Details
    # 0     Data Loading  10:00:01            0        1000   +1000                   Source: dataset.csv
    # 1    Preprocessing  10:00:05         1000         950     -50  Valid: 950, Invalid: 50. Order: ...
    # 2    Deduplication  10:00:08          950         800    -150       Method: auto (Log10 Transformed)

# CRITICAL: This protection block is REQUIRED for Windows multiprocessing!
# It prevents infinite recursive loops and memory explosion when n_jobs > 1.
if __name__ == '__main__':
    main()

Advanced Configuration

Web Service Integration

DiPTox supports the following chemical databases:

Note: ChemSpider, CompTox and CAS require API keys. Provide them during configuration:

DP.config_web_request(
    sources=['chemspider/comptox/CAS'],
    chemspider_api_key='your_personal_key',
    comptox_api_key='your_personal_key',
    cas_api_key='your_personal_key'
)

Requirements

  • Python>=3.8
  • Core Dependencies:
    • requests
    • rdkit>=2023.3
    • tqdm
    • openpyxl
    • scipy
    • streamlit>=1.0.0 (Required for GUI)
  • Optional Dependencies (install as needed, if not installed, then send the request using requests.):
    • pubchempy>=1.0.5: For PubChem integration
    • chemspipy>=2.0.0: For ChemSpider (requires API key)
    • ctx-python>=0.0.1a10: For CompTox Dashboard (requires API key)

License

Apache License 2.0 - See LICENSE for details

Support

Report issues on GitHub Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diptox-1.0.2.tar.gz (68.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diptox-1.0.2-py3-none-any.whl (68.4 kB view details)

Uploaded Python 3

File details

Details for the file diptox-1.0.2.tar.gz.

File metadata

  • Download URL: diptox-1.0.2.tar.gz
  • Upload date:
  • Size: 68.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for diptox-1.0.2.tar.gz
Algorithm Hash digest
SHA256 ec79a13aa7550ecaa29407141209e8d07b06c8985a2b157e52333da9c76d6def
MD5 13f9d3a84c39ef8f0567005ec7b29efd
BLAKE2b-256 2e8444bee9184384b16b969434e07034b2a4c024cae302740f8867b71ef7380c

See more details on using hashes here.

File details

Details for the file diptox-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: diptox-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 68.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for diptox-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4eb9635fd06ff584327f667d945ae06d877d152aee5960ed5b1c5608450b1e94
MD5 fd4f32884a6e47d59bc468d09b052be3
BLAKE2b-256 28e9fb0f93145b6231cee7896d5e6665a54c83ea70b3803150ae2e5c7d8eb606

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page