Skip to main content

AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets.

Project description

Data consistency assessment facilitates transfer learning in ADME modeling

AssayInspector: A Python package for diagnostic assessment of data consistency in molecular datasets

Python License: MIT PyPI - Version

AssayInspector

 

Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments between benchmark and gold-standard sources that degrade model performance. Our analyses further revealed that dataset discrepancies arise from differences in various factors, from experimental conditions in data collection to chemical space coverage. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.

Keywords: data reporting, molecular property, ADME, physicochemical, machine learning, data aggregation, predictive accuracy, benchmark

Installation

To install and use the package, first create the conda environment as follows:

conda env create -f AssayInspector_env.yml

Then, activate the environment:

conda activate assay_inspector

Finally, install the package from PyPI using pip:

pip install assay_inspector

Getting Started

To run AssayInspector, you first need to prepare your input data. The file should be in .tsv or .csv format and include the following required columns:

  • smiles: The SMILES string representation of each molecule in the dataset.
  • value: The annotated value for each molecule — use a numerical value for regression tasks or a binary label (0 or 1) for classification tasks.
  • ref: The reference source name from which each value-molecule annotation was obtained.
  • endpoint: The name of the endpoint to analyze.

You can find two example input files for the half-life and clearance datasets.

Usage

Once the input data file has been prepared, you can run AssayInspector in the following way:

from assay_inspector import AssayInspector

# Prepare AssayInspector report
report = AssayInspector(
	data_path='path/to/dataset/file.tsv',
	endpoint_name='endpoint',
	task='regression',
	feature_type='ecfp4',
	reference_set='path/to/reference_set.tsv' # optional
)

# Run AssayInspector report
report.get_individual_reporting()
report.get_comparative_reporting()

AssayInspector arguments

Argument Type Description
data_path str Path to the input dataset file (.csv or .tsv format).
endpoint_name str Name of the endpoint to analyze.
task str Type of task: either 'regression' or 'classification'.
feature_type str Type of features to use: one of 'ecfp4', 'rdkit', or 'custom'.
outliers_method str (Optional) Method to detect outliers: 'zscore' (default) or 'iqr'.
distance_metric str (Optional) Distance metric for custom descriptors: 'euclidean' (default).
descriptors_df pd.DataFrame (Optional) DataFrame containing molecular descriptors for dataset molecules (required when feature_type='custom').
reference_set str (Optional) Path to an additional dataset used for comparative analysis.
lower_bound int or float (Optional) Lower bound to define the endpoint applicability domain.
upper_bound int or float (Optional) Upper bound to define the endpoint applicability domain.

The resulting output will be saved in a folder named AssayInspector_YYYYMMDD, which will contain:

  • A tabular file that summarizes key descriptive parameters for each data source.
  • A comprehensive set of visualization plots that facilitate the detection of inconsistencies across data sources.
  • An insight report containing multiple alerts and recommendations to guide data cleaning and preprocessing.

Examples

Below are a few sample outputs generated by AssayInspector.

Endpoint Outlier Visualization Endpoint Distribution Comparative Visualization
Half-life Outlier Visualization Endpoint Distribution Comparative Visualization
Clearance Outlier Visualization Endpoint Distribution Comparative Visualization

License

AssayInspector is licensed under the MIT License. See the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assay_inspector-1.0.5.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assay_inspector-1.0.5-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file assay_inspector-1.0.5.tar.gz.

File metadata

  • Download URL: assay_inspector-1.0.5.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for assay_inspector-1.0.5.tar.gz
Algorithm Hash digest
SHA256 c241b536a92895c4a2967f96e2ff78f932967343889b19e29ba10d5671426b1a
MD5 52ac68e2bdfcdf4bab6677cacd894e82
BLAKE2b-256 4600edc24a388d73a7c8f0d0ba6b80a709774baffbd93b182670354b1602218a

See more details on using hashes here.

File details

Details for the file assay_inspector-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for assay_inspector-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e744f0eebee4b23249238096b4494ec41bab354141ac702cc16fc9e11434516a
MD5 2b1ebde235f9dd22e73fffbd0100c9e0
BLAKE2b-256 ba2ec4db8c7222af4489a9ab97980a070bc7acae92991d4ccf3a052c13188063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page