Skip to main content

Data Integration and Processing for Computational Toxicology

Project description

DiPTox - Data Integration and Processing for Computational Toxicology

DiPTox is a Python toolkit designed for the robust preprocessing, standardization, and multi-source data integration of molecular datasets, with a focus on computational toxicology workflows.

Official Release v1.0

We are excited to announce the first official stable release of DiPTox on PyPI! This milestone brings production-ready stability and performance enhancements:

  • Multi-Process Acceleration:
    • Accelerate chemical preprocessing tasks by 10x or more using the n_jobs parameter.
    • Intelligent task distribution across CPU cores for heavy datasets.
  • Cross-Platform Robustness:
    • Implemented a specialized "Guard Mechanism" for Windows multiprocessing to prevent memory explosion and recursive process spawning issues.
    • Verified stability across Windows, Linux, and macOS environments.
  • Enhanced Data Loading:
    • Switched to binary stream parsing for .sdf and .mol files to resolve encoding crashes (e.g., utf-8 vs latin-1).
    • Auto-parsing of molecular structures to generate SMILES even when properties are missing.

Version Update Log (1.0.5)

  • Enhanced Unit Standardization: Added support for the standard math operator ^ (power) by automatically mapping it to **, and fixed a logic error that caused single-unit datasets to be skipped even when a different target unit was specified.
  • Deduplication Logic Upgrades: Introduced a log10 transformation mode alongside the existing -log10 option, enabling support for both toxicity data (pIC50) and physicochemical properties like water solubility (logS) or partition coefficients.
  • Robustness & Error Handling: Implemented strict numerical validation using errors='coerce' in standardization and deduplication modules to automatically filter out invalid strings (e.g., "N/A", ">100") with clear warning feedback in the GUI.
  • Critical State Management Fix: Resolved an issue where load_data failed to reset the _preprocess_key flag, ensuring that automatic column mapping logic for Web Requests (like auto-detecting smiles_from_web) functions correctly after a new dataset is loaded.

Version Update Log (1.0.4)

  • GUI State Management Fix: Resolved a StreamlitAPIException on the Export page that occurred when using the "Undo Last Step" feature. Implemented proper on_click callbacks to safely mutate the session_state (specifically for export_selected_cols) before the UI re-renders, ensuring a crash-free and seamless undo experience.
  • Refined Preprocessing Rules: Adjusted and optimized several default charge neutralization rules.

Version Update Log (1.0.3)

  • Enhanced Unit Standardization: Custom conversion formulas now fully support molecular weight (mw). You can seamlessly convert between molarity and mass concentrations (e.g., using formulas like x * mw * 1000).
  • GUI Interface Optimization: The Streamlit graphical interface has been beautifully redesigned for a more professional, clean, and logical scientific layout. We've reduced visual clutter, grouped configuration panels intuitively, and improved component alignment.
  • Comprehensive Audit Log (History): The processing history has been heavily upgraded. It now records granular parameters for every operation—including exactly which preprocessing rules were triggered, active deduplication conditions, web query statuses, and substructure search match counts.

DiPTox Community Check-in (Optional)

To help us understand our user base and improve the software, DiPTox includes a one-time, optional survey on first use.

  • Completely Optional: You can skip it with a single click.
  • Privacy-Focused: The information helps us with academic impact assessment. It will not be shared.

Core Features

1. Graphical User Interface (GUI)

Powered by Streamlit, the GUI allows users to perform all workflows visually without writing code.

  • Visual Operation: Complete workflow control via a web browser.
  • Real-time Preview: Instantly view data changes after applying rules.
  • Rule Management: Add/Remove valid atoms, salts, solvents, and unit conversion formulas interactively.
  • Smart Column Mapping: Intelligent detection of headers and binary file structures.

2. Chemical Preprocessing & Standardization

A configurable pipeline to clean and normalize chemical structures.

  • Strict Inorganic Filtering: Updated SMARTS matching to accurately identify complex inorganic species (e.g., ionic cyanides) without misclassifying organic nitriles.
  • Pipeline Steps:
    • Remove salts & solvents
    • Handle mixtures (keep largest fragment)
    • Remove inorganic molecules
    • Neutralize charges & Validate atomic composition
    • Remove explicit hydrogens, stereochemistry, and isotopes
    • Reject Radical Species: Automatically discard molecules containing free radical atoms.
    • Standardize to canonical SMILES
    • Filter by atom count

3. Unit Standardization

Normalize heterogeneous target data into a single unit effortlessly.

  • Automatic Conversion: Built-in rules for Concentration, Time, Pressure, and Temperature.
  • Custom Formulas: Define mathematical rules (e.g., x * 1000 or 10**(-x)) interactively via GUI or script.
  • Unified Output: Standardize diverse units (e.g., ug/mL, g/L, M) to a single target (e.g., mg/L).

4. Data Deduplication

Flexible strategies for handling duplicate entries with advanced controls.

  • Data Types: Supports continuous (e.g., IC50) and discrete (e.g., Active/Inactive) targets.
  • Methods: auto, IQR, 3sigma, vote, or custom priority rules.
  • Log Transformation: Optional -log10 transformation (e.g., IC50 $\to$ pIC50) applied before deduplication logic to handle bioactivity data correctly.
  • Flexible NaN Handling: Option to retain rows with missing conditions (treating NaN as a valid group) instead of dropping them.

5. Comprehensive History Tracking (Audit Log)

  • Records every operation (Loading, Preprocessing, Filtering, etc.) in an Audit Log.
  • Tracks timestamps, operation details, and row count changes (Delta) step-by-step.
  • Available via API (get_history()) and visualized in the GUI.

6. Identifier & Property Integration

  • Fetch and interconvert identifiers (CAS, SMILES, IUPAC, MW) from multiple sources (PubChem, ChemSpider, CompTox, Cactus, CAS Common Chemistry, ChEMBL).
  • High-performance concurrent requests with automatic rate limiting and retries.

7. Utility Tools

  • Perform substructure searches using SMILES or SMARTS patterns.
  • Customize chemical processing rules for neutralization reactions, salt/solvent lists, and valid atoms.
  • Display a summary of all currently active processing rules.

Installation

Install the official stable version from PyPI:

pip install diptox

GUI

After installation, you can launch the graphical interface directly from your terminal:

diptox-gui

This command will automatically open the DiPTox interface in your default web browser.

Quick Start

from diptox import DiptoxPipeline

def main():
    # Initialize processor
    DP = DiptoxPipeline()

    # Load data
    DP.load_data(input_data='file_path/list/dataframe', smiles_col, target_col, cas_col, unit_col)

    # Customize Processing Rules (Optional)
    print("--- Default Rules ---")
    DP.display_processing_rules()

    DP.manage_atom_rules(atoms=['Si'], add=True)         # Add 'Si' to the list of valid atoms
    DP.manage_default_salt(salts=['[Na+]'], add=False)   # Example: remove sodium from the salt list
    DP.manage_default_solvent(solvents='Cl', add=False)  # Example: remove chlorine from the solvent list
    DP.add_neutralization_rule('[$([N-]C=O)]', 'N')      # Add a custom neutralization rule

    print("\n--- Customized Rules ---")
    DP.display_processing_rules()

    # Configure preprocessing
    DP.preprocess(
        remove_salts=True,              # Remove salt fragments. Default: True.
        remove_solvents=True,           # Remove solvent fragments. Default: True.
        remove_mixtures=True,           # Handle mixtures based on fragment size. Default: False.
        hac_threshold=3,                # Heavy atom count threshold for fragment removal. Default: 3.
        keep_largest_fragment=True,     # Keep the largest fragment in a mixture. Default: True.
        remove_inorganic=False,         # Remove common inorganic molecules. Default: True.
        neutralize=True,                # Neutralize charges on the molecule. Default: True.
        reject_non_neutral=False,       # Only retain the molecules whose formal charge is zero. Default: False.
        check_valid_atoms=True,         # Check if all atoms are in the valid list. Default: False.
        strict_atom_check=False,        # If True, discard molecules with invalid atoms. If False, try to remove them. Default: False.
        remove_stereo=False,            # Remove stereochemistry information. Default: False.
        remove_isotopes=True,           # Remove isotopic information. Default: True.
        remove_hs=True,                 # Remove explicit hydrogen atoms. Default: True.
        reject_radical_species=True,    # Molecules containing free radical atoms are directly rejected. Default: True.
        n_jobs=4                        # Accelerate using 4 CPU cores. Default: 1 
    )

    # Configure deduplication and unit standardization
    conversion_rules = {('g/L', 'mg/L'): 'x * 1000', 
                        ('ug/L', 'mg/L'): 'x / 1000',}
    DP.config_deduplicator(condition_cols, data_type, method, custom_method, priority, standard_unit, conversion_rules, log_transform, dropna_conditions)
    DP.dataset_deduplicate()

    # Configure web queries
    DP.config_web_request(sources=['pubchem/chemspider/comptox/cactus/cas'], max_workers, ...)
    DP.web_request(send='cas', request=['smiles', 'iupac'])

    # Substructure search
    DP.substructure_search(query_pattern, is_smarts=True)

    # Save results
    DP.save_results(output_path='file_path')

    # View Processing History (Audit Log)
    print(DP.get_history())
    # Output Example:
    #               Step Timestamp  Rows Before  Rows After   Delta                               Details
    # 0     Data Loading  10:00:01            0        1000   +1000                   Source: dataset.csv
    # 1    Preprocessing  10:00:05         1000         950     -50  Valid: 950, Invalid: 50. Order: ...
    # 2    Deduplication  10:00:08          950         800    -150       Method: auto (Log10 Transformed)

# CRITICAL: This protection block is REQUIRED for Windows multiprocessing!
# It prevents infinite recursive loops and memory explosion when n_jobs > 1.
if __name__ == '__main__':
    main()

Advanced Configuration

Web Service Integration

DiPTox supports the following chemical databases:

Note: ChemSpider, CompTox and CAS require API keys. Provide them during configuration:

DP.config_web_request(
    sources=['chemspider/comptox/CAS'],
    chemspider_api_key='your_personal_key',
    comptox_api_key='your_personal_key',
    cas_api_key='your_personal_key'
)

Requirements

  • Python>=3.8
  • Core Dependencies:
    • requests
    • rdkit>=2023.3
    • tqdm
    • openpyxl
    • scipy
    • streamlit>=1.0.0 (Required for GUI)
  • Optional Dependencies (install as needed, if not installed, then send the request using requests.):
    • pubchempy>=1.0.5: For PubChem integration
    • chemspipy>=2.0.0: For ChemSpider (requires API key)
    • ctx-python>=0.0.1a10: For CompTox Dashboard (requires API key)

License

Apache License 2.0 - See LICENSE for details

Support

Report issues on GitHub Issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diptox-1.0.5.tar.gz (72.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diptox-1.0.5-py3-none-any.whl (71.7 kB view details)

Uploaded Python 3

File details

Details for the file diptox-1.0.5.tar.gz.

File metadata

  • Download URL: diptox-1.0.5.tar.gz
  • Upload date:
  • Size: 72.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for diptox-1.0.5.tar.gz
Algorithm Hash digest
SHA256 61ab7f2b7fae26ca0491d14b0eb9bf8ce2a8fad79231088e40d6093f629de793
MD5 2407de5134982bd7658f96b4e4f2cde6
BLAKE2b-256 a898b0434fdf58820e0b056154d21c5727eef4f0c4a6609d3e450ce316776d33

See more details on using hashes here.

File details

Details for the file diptox-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: diptox-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 71.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for diptox-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 48a2a1d99cae750f0fe0a8934d984f3e1a0dc426bac9d1357f3381224a22c181
MD5 b3dfe244b121b67e66fdb1bc87d8851f
BLAKE2b-256 e03889644a7fd5a0b060a1ca2fa6924bae5f05cb751150e3232e2ba038ee816d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page