A package for automating QSAR analysis on ChEMBL data.

These details have not been verified by PyPI

Project links

Project description

`chembl_miner`: A Python Package for QSAR Analysis

chembl_miner is a Python package that streamlines the process of Quantitative Structure-Activity Relationship (QSAR) analysis using data from the ChEMBL database. It provides a complete and automated workflow with a suite of tools for data retrieval, preprocessing, feature engineering, machine learning, and model interpretation.

Motivation 🤔

This package was developed to organize and automate the QSAR analysis workflow commonly used in pharmaceutical sciences for drug discovery and repurposing. The goal is to provide a clean, reusable, and shareable tool for researchers in the field.

Target Audience 🎯

The primary audience for this package is drug discovery and repurposing researchers, including medicinal chemists, bioinformaticians, and data scientists.

Key Features 🔬

Data Retrieval: Fetch bioactivity data from the ChEMBL database using target ChEMBL IDs and specific activity types.
Data Preprocessing: A robust and customizable preprocessing pipeline to clean and prepare your data for modeling, which includes handling missing values, converting units, and treating duplicate entries.
Feature Engineering: Calculate various molecular fingerprints from PaDEL descriptors (e.g., PubChem, MACCS, EState).
Machine Learning: A flexible wrapper for scikit-learn compatible regression models that handles hyperparameter optimization using a genetic algorithm, model evaluation with cross-validation, and fitting of the final model.
Dataset Management: Convenient wrappers to handle and split datasets for both training and deployment, with support for structural (or scaffold) splitting.
Model Explanation and Analysis: Tools for both exploratory data analysis and model interpretation.

Installation 💻

You can install chembl_miner using pip. There are no other dependencies required.

pip install chembl_miner

Typical Workflow 🧪

The following is a general workflow for using chembl_miner.

1. Data Retrieval and Preprocessing

First, fetch the activity data from ChEMBL for a specific target and preprocess it.

from chembl_miner import *
import pandas as pd

# Set verbosity level (0, 1, or 2)
set_verbosity(1)

# Fetch activity data from ChEMBL
target_chembl_id = "CHEMBL203"  # Example: Cyclooxygenase-1
activity_df = get_activity_data(target_chembl_id, activity_type="IC50")

# Optionally, review and filter assays. For example, to exclude assays with certain keywords:
id_list = review_assays(activity_df, max_entries=5,
                        assay_keywords=['mutant', 'mutated'], exclude_keywords=True)

# Preprocess the data
activity_df = preprocess_data(activity_df, convert_units=True, assay_ids=id_list)

# Calculate molecular fingerprints
descriptors_df = calculate_fingerprint(activity_df, fingerprint="pubchem")

# Create a TrainingData object
# This will split the data into training and testing sets.
# You can use any TrainingData object for the subsequent steps.
dataset = TrainingData.from_dataframe(activity_df, descriptors_df, use_structural_split=True)

# Saving the dataset is optional but recommended for reproducibility
# and to avoid re-running the preprocessing steps.
dataset.to_path("my_qsar_dataset")

Example Output of `review_assays`

Displaying 5 of 1894 total unique assays.
To see more, adjust the 'max_entries' parameter.

assay_chembl_id  assay_description
CHEMBL648388     In vitro antifungal activity against Aspergillus fumigatus                189
CHEMBL647385     Minimum inhibitory concentration (MIC) against Aspergillus fumigatus          116
CHEMBL3266263    Antimicrobial activity against Aspergillus fumigatus 7544 after 48 hrs...   111
CHEMBL649169     Antifungal activity against Aspergillus fumigatus Saito strain                103
CHEMBL899327     Antifungal activity against Aspergillus fumigatus                              98
Name: count, dtype: int64

2. Hyperparameter Optimization

Next, optimize the hyperparameters of your machine learning model. This example uses XGBoost regression and optimizes for Mean Absolute Error (MAE).

# Load the dataset (if you saved it previously)
# dataset = TrainingData.from_path('my_qsar_dataset')

# Set up the model pipeline
ml = ModelPipeline.setup('xgboost_reg', scoring=['mae'])

# Optimize hyperparameters using a genetic algorithm
param_search = ml.optimize_hyperparameters(dataset=dataset, cv=5, refit='mae', population_size=40)

# Evaluate the model with 10-fold cross-validation
cv_results = ml.evaluate_model(dataset=dataset, cv=10)
print(ml.unpack_cv_results(cv_results))

# The best parameters are stored in ml.params
print('Best parameters found:', ml.params)

Example Output of `unpack_cv_results`

  scorer dataset_type      mean        sd
0    mae         test -0.486349  0.063903
1    mae        train -0.103694  0.003171

3. Model Training and Analysis

With the best hyperparameters, train the final model and analyze its performance. For a detailed guide on interpreting the diagnostic plots, please refer to this resource by Przemyslaw Biecek and Tomasz Burzykowski (2020).

from chembl_miner import ModelAnalyzer
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

# If `refit` was used in `optimize_hyperparameters`, the model is already fitted.
# Otherwise, you can fit it with the best parameters:
# ml.fit(dataset, params=ml.params)

# Align test set features with the model's features
dataset.x_test = dataset.x_test[ml.fit_model.feature_names_in_]

# Make predictions on the test set
y_pred = ml.fit_model.predict(dataset.x_test)

# Calculate performance metrics
r2 = r2_score(y_true=dataset.y_test, y_pred=y_pred)
mae = mean_absolute_error(y_true=dataset.y_test, y_pred=y_pred)
rmse = root_mean_squared_error(y_true=dataset.y_test, y_pred=y_pred)

print(f"R^2 Score: {r2:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")

# Analyze the model's performance on the test set
explainer = ModelAnalyzer(dataset=dataset, fit_model=ml.fit_model, train_subset=False)
explainer.plot_actual_vs_predicted()
explainer.plot_residuals_vs_fitted()

Future Implementations 🚀

Expanded Descriptor Support: Addition of more molecular descriptors (e.g.: RDKit fingerprinters).
Advanced Hyperparameter Search: Implementation of additional methods like Grid Search and Random Search.
Enhanced CV Visualization: Improved visualization of cross-validation results.
Classification Models: Support for classification algorithms.
Model Explainability: Integration of SHAP for deeper model interpretation.
Similarity Filtering: Options for similarity-based filtering during data retrieval.
R Implementation: Potential for an R version of the package.

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Issues and Contact 📬

Please report any issues or questions on the GitHub issues page or through e-mail: henrique.wuchryn@ufpr.br.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.12

Mar 19, 2026

0.1.11

Nov 11, 2025

0.1.10

Oct 24, 2025

0.1.9

Oct 22, 2025

0.1.8

Oct 21, 2025

0.1.7

Oct 21, 2025

This version

0.1.6.post0

Oct 20, 2025

0.1.6

Oct 20, 2025

0.1.5

Oct 14, 2025

0.1.4

Oct 13, 2025

0.1.3

Oct 1, 2025

0.1.2

Oct 1, 2025

0.1.1

Oct 1, 2025

0.1.post5

Oct 14, 2025

0.1.post0

Sep 29, 2025

0.1.0

Sep 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chembl_miner-0.1.6.post0.tar.gz (29.2 kB view details)

Uploaded Oct 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chembl_miner-0.1.6.post0-py3-none-any.whl (38.1 kB view details)

Uploaded Oct 20, 2025 Python 3

File details

Details for the file chembl_miner-0.1.6.post0.tar.gz.

File metadata

Download URL: chembl_miner-0.1.6.post0.tar.gz
Upload date: Oct 20, 2025
Size: 29.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for chembl_miner-0.1.6.post0.tar.gz
Algorithm	Hash digest
SHA256	`a78235cd1840662a8e4f790fa0455346c679b256115c70e34d0620c843a32fb8`
MD5	`28e480f3ac0b29e21f93de693d5e6ccd`
BLAKE2b-256	`6723aaacc0a846cd9c2a1d47f0f336d0de844f2d67ae5a989446e41ea018ff0a`

See more details on using hashes here.

File details

Details for the file chembl_miner-0.1.6.post0-py3-none-any.whl.

File metadata

Download URL: chembl_miner-0.1.6.post0-py3-none-any.whl
Upload date: Oct 20, 2025
Size: 38.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for chembl_miner-0.1.6.post0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5fd40879cea6d13803d1787f3619431b8611033c4f79e8cf7abbeb3ed3ace8e3`
MD5	`aaad7451ffada1a5f517be831be44379`
BLAKE2b-256	`cad730c4f6f596c6efbdbfd27241cda231492d69d800e5a2bb515e0876796310`

See more details on using hashes here.

chembl-miner 0.1.6.post0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`chembl_miner`: A Python Package for QSAR Analysis

Motivation 🤔

Target Audience 🎯

Key Features 🔬

Installation 💻

Typical Workflow 🧪

1. Data Retrieval and Preprocessing

Example Output of `review_assays`

2. Hyperparameter Optimization

Example Output of `unpack_cv_results`

3. Model Training and Analysis

Future Implementations 🚀

License 📄

Issues and Contact 📬

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

chembl-miner 0.1.6.post0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chembl_miner: A Python Package for QSAR Analysis

Motivation 🤔

Target Audience 🎯

Key Features 🔬

Installation 💻

Typical Workflow 🧪

1. Data Retrieval and Preprocessing

Example Output of review_assays

2. Hyperparameter Optimization

Example Output of unpack_cv_results

3. Model Training and Analysis

Future Implementations 🚀

License 📄

Issues and Contact 📬

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`chembl_miner`: A Python Package for QSAR Analysis

Example Output of `review_assays`

Example Output of `unpack_cv_results`