Data Integration and Processing for Computational Toxicology
Project description
DiPTox - Data Integration and Processing for Computational Toxicology
DiPTox is a Python toolkit designed for the robust preprocessing, standardization, and multi-source data integration of molecular datasets, with a focus on computational toxicology workflows.
Official Release v1.0
We are excited to announce the first official stable release of DiPTox on PyPI! This milestone brings production-ready stability and performance enhancements:
- Multi-Process Acceleration:
- Accelerate chemical preprocessing tasks by 10x or more using the
n_jobsparameter. - Intelligent task distribution across CPU cores for heavy datasets.
- Accelerate chemical preprocessing tasks by 10x or more using the
- Cross-Platform Robustness:
- Implemented a specialized "Guard Mechanism" for Windows multiprocessing to prevent memory explosion and recursive process spawning issues.
- Verified stability across Windows, Linux, and macOS environments.
- Enhanced Data Loading:
- Switched to binary stream parsing for
.sdfand.molfiles to resolve encoding crashes (e.g.,utf-8vslatin-1). - Auto-parsing of molecular structures to generate SMILES even when properties are missing.
- Switched to binary stream parsing for
Version Update Log (1.0.6)
- Data Loading Fixes: Fixed and optimized the native parsing logic for
.smi(SMILES) files, resolving previous reading issues to ensure stable ingestion of large-scale chemical databases. - Web Request Module Overhaul: Completely refactored the network request engine for stability and transparency. This update introduces a "Capability Map" and fast-fail logic to intelligently intercept unsupported queries and Auth/404 errors (eliminating infinite retry deadlocks). Furthermore, it eradicates "silent failures" by logging highly granular failure reasons (e.g.,
Failed -> pubchem: Not Found | chemspider: Auth Error (401)), and implements field-level data provenance to strictly record the exact source for each molecular property, drastically improving dataset auditability.
Version Update Log (1.0.5)
- Enhanced Unit Standardization: Added support for the standard math operator
^(power) by automatically mapping it to**, and fixed a logic error that caused single-unit datasets to be skipped even when a different target unit was specified. - Deduplication Logic Upgrades: Introduced a
log10transformation mode alongside the existing-log10option, enabling support for both toxicity data (pIC50) and physicochemical properties like water solubility (logS) or partition coefficients. - Robustness & Error Handling: Implemented strict numerical validation using
errors='coerce'in standardization and deduplication modules to automatically filter out invalid strings (e.g., "N/A", ">100") with clear warning feedback in the GUI. - Critical State Management Fix: Resolved an issue where
load_datafailed to reset the_preprocess_keyflag, ensuring that automatic column mapping logic for Web Requests (like auto-detectingsmiles_from_web) functions correctly after a new dataset is loaded.
Version Update Log (1.0.4)
- GUI State Management Fix: Resolved a
StreamlitAPIExceptionon the Export page that occurred when using the "Undo Last Step" feature. Implemented properon_clickcallbacks to safely mutate thesession_state(specifically forexport_selected_cols) before the UI re-renders, ensuring a crash-free and seamless undo experience. - Refined Preprocessing Rules: Adjusted and optimized several default charge neutralization rules.
Version Update Log (1.0.3)
- Enhanced Unit Standardization: Custom conversion formulas now fully support molecular weight (
mw). You can seamlessly convert between molarity and mass concentrations (e.g., using formulas likex * mw * 1000). - GUI Interface Optimization: The Streamlit graphical interface has been beautifully redesigned for a more professional, clean, and logical scientific layout. We've reduced visual clutter, grouped configuration panels intuitively, and improved component alignment.
- Comprehensive Audit Log (History): The processing history has been heavily upgraded. It now records granular parameters for every operation—including exactly which preprocessing rules were triggered, active deduplication conditions, web query statuses, and substructure search match counts.
DiPTox Community Check-in (Optional)
To help us understand our user base and improve the software, DiPTox includes a one-time, optional survey on first use.
- Completely Optional: You can skip it with a single click.
- Privacy-Focused: The information helps us with academic impact assessment. It will not be shared.
Core Features
1. Graphical User Interface (GUI)
Powered by Streamlit, the GUI allows users to perform all workflows visually without writing code.
- Visual Operation: Complete workflow control via a web browser.
- Real-time Preview: Instantly view data changes after applying rules.
- Rule Management: Add/Remove valid atoms, salts, solvents, and unit conversion formulas interactively.
- Smart Column Mapping: Intelligent detection of headers and binary file structures.
2. Chemical Preprocessing & Standardization
A configurable pipeline to clean and normalize chemical structures.
- Strict Inorganic Filtering: Updated SMARTS matching to accurately identify complex inorganic species (e.g., ionic cyanides) without misclassifying organic nitriles.
- Pipeline Steps:
- Remove salts & solvents
- Handle mixtures (keep largest fragment)
- Remove inorganic molecules
- Neutralize charges & Validate atomic composition
- Remove explicit hydrogens, stereochemistry, and isotopes
- Reject Radical Species: Automatically discard molecules containing free radical atoms.
- Standardize to canonical SMILES
- Filter by atom count
3. Unit Standardization
Normalize heterogeneous target data into a single unit effortlessly.
- Automatic Conversion: Built-in rules for Concentration, Time, Pressure, and Temperature.
- Custom Formulas: Define mathematical rules (e.g.,
x * 1000or10**(-x)) interactively via GUI or script. - Unified Output: Standardize diverse units (e.g.,
ug/mL,g/L,M) to a single target (e.g.,mg/L).
4. Data Deduplication
Flexible strategies for handling duplicate entries with advanced controls.
- Data Types: Supports
continuous(e.g., IC50) anddiscrete(e.g., Active/Inactive) targets. - Methods:
auto,IQR,3sigma,vote, or custom priority rules. - Log Transformation: Optional
-log10transformation (e.g., IC50 $\to$ pIC50) applied before deduplication logic to handle bioactivity data correctly. - Flexible NaN Handling: Option to retain rows with missing conditions (treating NaN as a valid group) instead of dropping them.
5. Comprehensive History Tracking (Audit Log)
- Records every operation (Loading, Preprocessing, Filtering, etc.) in an Audit Log.
- Tracks timestamps, operation details, and row count changes (Delta) step-by-step.
- Available via API (
get_history()) and visualized in the GUI.
6. Identifier & Property Integration
- Fetch and interconvert identifiers (CAS, SMILES, IUPAC, MW) from multiple sources (PubChem, ChemSpider, CompTox, Cactus, CAS Common Chemistry, ChEMBL).
- High-performance concurrent requests with automatic rate limiting and retries.
7. Utility Tools
- Perform substructure searches using SMILES or SMARTS patterns.
- Customize chemical processing rules for neutralization reactions, salt/solvent lists, and valid atoms.
- Display a summary of all currently active processing rules.
Installation
Install the official stable version from PyPI:
pip install diptox
GUI
After installation, you can launch the graphical interface directly from your terminal:
diptox-gui
This command will automatically open the DiPTox interface in your default web browser.
Quick Start
from diptox import DiptoxPipeline
def main():
# Initialize processor
DP = DiptoxPipeline()
# Load data
DP.load_data(input_data='file_path/list/dataframe', smiles_col, target_col, cas_col, unit_col)
# Customize Processing Rules (Optional)
print("--- Default Rules ---")
DP.display_processing_rules()
DP.manage_atom_rules(atoms=['Si'], add=True) # Add 'Si' to the list of valid atoms
DP.manage_default_salt(salts=['[Na+]'], add=False) # Example: remove sodium from the salt list
DP.manage_default_solvent(solvents='Cl', add=False) # Example: remove chlorine from the solvent list
DP.add_neutralization_rule('[$([N-]C=O)]', 'N') # Add a custom neutralization rule
print("\n--- Customized Rules ---")
DP.display_processing_rules()
# Configure preprocessing
DP.preprocess(
remove_salts=True, # Remove salt fragments. Default: True.
remove_solvents=True, # Remove solvent fragments. Default: True.
remove_mixtures=True, # Handle mixtures based on fragment size. Default: False.
hac_threshold=3, # Heavy atom count threshold for fragment removal. Default: 3.
keep_largest_fragment=True, # Keep the largest fragment in a mixture. Default: True.
remove_inorganic=False, # Remove common inorganic molecules. Default: True.
neutralize=True, # Neutralize charges on the molecule. Default: True.
reject_non_neutral=False, # Only retain the molecules whose formal charge is zero. Default: False.
check_valid_atoms=True, # Check if all atoms are in the valid list. Default: False.
strict_atom_check=False, # If True, discard molecules with invalid atoms. If False, try to remove them. Default: False.
remove_stereo=False, # Remove stereochemistry information. Default: False.
remove_isotopes=True, # Remove isotopic information. Default: True.
remove_hs=True, # Remove explicit hydrogen atoms. Default: True.
reject_radical_species=True, # Molecules containing free radical atoms are directly rejected. Default: True.
n_jobs=4 # Accelerate using 4 CPU cores. Default: 1
)
# Configure deduplication and unit standardization
conversion_rules = {('g/L', 'mg/L'): 'x * 1000',
('ug/L', 'mg/L'): 'x / 1000',}
DP.config_deduplicator(condition_cols, data_type, method, custom_method, priority, standard_unit, conversion_rules, log_transform, dropna_conditions)
DP.dataset_deduplicate()
# Configure web queries
DP.config_web_request(sources=['pubchem/chemspider/comptox/cactus/cas'], max_workers, ...)
DP.web_request(send='cas', request=['smiles', 'iupac'])
# Substructure search
DP.substructure_search(query_pattern, is_smarts=True)
# Save results
DP.save_results(output_path='file_path')
# View Processing History (Audit Log)
print(DP.get_history())
# Output Example:
# Step Timestamp Rows Before Rows After Delta Details
# 0 Data Loading 10:00:01 0 1000 +1000 Source: dataset.csv
# 1 Preprocessing 10:00:05 1000 950 -50 Valid: 950, Invalid: 50. Order: ...
# 2 Deduplication 10:00:08 950 800 -150 Method: auto (Log10 Transformed)
# CRITICAL: This protection block is REQUIRED for Windows multiprocessing!
# It prevents infinite recursive loops and memory explosion when n_jobs > 1.
if __name__ == '__main__':
main()
Advanced Configuration
Web Service Integration
DiPTox supports the following chemical databases:
PubChem: https://pubchem.ncbi.nlm.nih.gov/ChemSpider: https://www.chemspider.com/CompTox: https://comptox.epa.gov/dashboard/Cactus: https://cactus.nci.nih.gov/CAS: https://commonchemistry.cas.org/ChEMBL: https://www.ebi.ac.uk/chembl/
Note: ChemSpider, CompTox and CAS require API keys. Provide them during configuration:
DP.config_web_request(
sources=['chemspider/comptox/CAS'],
chemspider_api_key='your_personal_key',
comptox_api_key='your_personal_key',
cas_api_key='your_personal_key'
)
Requirements
Python>=3.8- Core Dependencies:
requestsrdkit>=2023.3tqdmopenpyxlscipystreamlit>=1.0.0(Required for GUI)
- Optional Dependencies (install as needed, if not installed, then send the request using
requests.):pubchempy>=1.0.5: For PubChem integrationchemspipy>=2.0.0: For ChemSpider (requires API key)ctx-python>=0.0.1a10: For CompTox Dashboard (requires API key)
License
Apache License 2.0 - See LICENSE for details
Support
Report issues on GitHub Issues
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diptox-1.0.6.tar.gz.
File metadata
- Download URL: diptox-1.0.6.tar.gz
- Upload date:
- Size: 74.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17d0c3eecaec099514a1a1bfe37e51466428f3dd65591e6b6d2f73e6e4499ab7
|
|
| MD5 |
32014762f7e9a63bdc38e6359b37ada4
|
|
| BLAKE2b-256 |
c2d97365611d1ab2de648c31d5c23fd71942b27a21628cc93230cb669d523e49
|
File details
Details for the file diptox-1.0.6-py3-none-any.whl.
File metadata
- Download URL: diptox-1.0.6-py3-none-any.whl
- Upload date:
- Size: 73.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66151664ddc26b45b00fd49b9785d0ce7bfe2a356233cc1755ca4cb65ad7546b
|
|
| MD5 |
b742adf40211ce7068ccd56d2f578c91
|
|
| BLAKE2b-256 |
a277c511235f239206f83797e646305c8a929d8cb172a3da8d7af7120c815955
|