Skip to main content

Library for effective molecular fingerprints calculation

Project description

scikit-fingerprints

PyPI version Downloads Code style: ruff License PyPI - Python Version Contributors


change image for different github color schemes

scikit-fingerprints is a Python library for efficient computation of molecular fingerprints.

Table of Contents


Description

Molecular fingerprints are crucial in various scientific fields, including drug discovery, materials science, and chemical analysis. However, existing Python libraries for computing molecular fingerprints often lack performance, user-friendliness, and support for modern programming standards. This project aims to address these shortcomings by creating an efficient and accessible Python library for molecular fingerprint computation.

See the documentation and API reference for details.

Main features:

  • scikit-learn compatible
  • feature-rich, with >30 fingerprints
  • parallelization
  • sparse matrix support
  • commercial-friendly MIT license

Supported platforms

python3.10 python3.11 python3.12 python3.13
Linux
Windows
macOS

Python 3.9 was supported up to scikit-fingerprints 1.13.0.

Python 3.13 is officially supported, but underlying libraries may not be fully compatible yet.

Installation

You can install the library using pip:

pip install scikit-fingerprints

If you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities, you can also install directly from GitHub:

pip install git+https://github.com/scikit-fingerprints/scikit-fingerprints.git

Quickstart

Most fingerprints are based on molecular graphs (topological, 2D-based), and you can use SMILES input directly:

from skfp.fingerprints import AtomPairFingerprint

smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]

atom_pair_fingerprint = AtomPairFingerprint()

X = atom_pair_fingerprint.transform(smiles_list)
print(X)

For fingerprints using conformers (conformational, 3D-based), you need to create molecules first and compute conformers. Those fingerprints have requires_conformers attribute set to True.

from skfp.preprocessing import ConformerGenerator, MolFromSmilesTransformer
from skfp.fingerprints import WHIMFingerprint

smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]

mol_from_smiles = MolFromSmilesTransformer()
conf_gen = ConformerGenerator()
fp = WHIMFingerprint()
print(fp.requires_conformers)  # True

mols_list = mol_from_smiles.transform(smiles_list)
mols_list = conf_gen.transform(mols_list)

X = fp.transform(mols_list)
print(X)

You can also use scikit-learn functionalities like pipelines, feature unions etc. to build complex workflows. Popular datasets, e.g. from MoleculeNet benchmark, can be loaded directly.

from skfp.datasets.moleculenet import load_clintox
from skfp.metrics import multioutput_auroc_score, extract_pos_proba
from skfp.model_selection import scaffold_train_test_split
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.preprocessing import MolFromSmilesTransformer

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union


smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
    smiles, y, test_size=0.2
)

pipeline = make_pipeline(
    MolFromSmilesTransformer(),
    make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
    RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)

y_pred_proba = pipeline.predict_proba(smiles_test)
y_pred_proba = extract_pos_proba(y_pred_proba)
auroc = multioutput_auroc_score(y_test, y_pred_proba)
print(f"AUROC: {auroc:.2%}")

Examples

You can find Jupyter Notebooks with examples and tutorials in documentation, as well as in the "examples" directory.

Examples and tutorials:

  1. Introduction to scikit-fingerprints
  2. Fingerprint types
  3. Molecular pipelines
  4. Conformers and conformational fingerprints
  5. Hyperparameter tuning
  6. Dataset splits
  7. Datasets and benchmarking
  8. Similarity and distance metrics
  9. Molecular filters

Project overview

scikit-fingerprints brings molecular fingerprints and related functionalities into the scikit-learn ecosystem. With familiar class-based design and .transform() method, fingerprints can be computed from SMILES strings or RDKit Mol objects. Resulting NumPy arrays or SciPy sparse arrays can be directly used in ML pipelines.

Main features:

  1. Scikit-learn compatible: scikit-fingerprints uses familiar scikit-learn interface and conforms to its API requirements. You can include molecular fingerprints in pipelines, concatenate them with feature unions, and process with ML algorithms.

  2. Performance optimization: both speed and memory usage are optimized, by utilizing parallelism (with Joblib) and sparse CSR matrices (with SciPy). Heavy computation is typically relegated to C++ code of RDKit.

  3. Feature-rich: in addition to computing fingerprints, you can load popular benchmark datasets (e.g. from MoleculeNet), perform splitting (e.g. scaffold split), generate conformers, and optimize hyperparameters with optimized cross-validation.

  4. Well-documented: each public function and class has extensive documentation, including relevant implementation details, caveats, and literature references.

  5. Extensibility: any functionality can be easily modified or extended by inheriting from existing classes.

  6. High code quality: pre-commit hooks scan each commit for code quality (e.g. black, flake8), typing (mypy), and security (e.g. bandit, pip-audit). CI/CD process with GitHub Actions also includes over 250 unit and integration tests.

Citing

If you use scikit-fingerprints in your work, please cite our main publication, available on SoftwareX (open access):

@article{scikit_fingerprints,
   title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},
   author = {Jakub Adamczyk and Piotr Ludynia},
   journal = {SoftwareX},
   volume = {28},
   pages = {101944},
   year = {2024},
   issn = {2352-7110},
   doi = {https://doi.org/10.1016/j.softx.2024.101944},
   url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},
   keywords = {Molecular fingerprints, Chemoinformatics, Molecular property prediction, Python, Machine learning, Scikit-learn},
}

Its preprint is also available on ArXiv.

Publications and usage

Publications using scikit-fingerprints:

  1. J. Adamczyk, W. Czech "Molecular Topological Profile (MOLTOP) - Simple and Strong Baseline for Molecular Graph Classification" ECAI 2024
  2. J. Adamczyk, P. Ludynia "Scikit-fingerprints: easy and efficient computation of molecular fingerprints in Python" SoftwareX
  3. J. Adamczyk, P. Ludynia, W. Czech "Molecular Fingerprints Are Strong Models for Peptide Function Prediction" ArXiv preprint
  4. J. Adamczyk "Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology" CIKM 2025
  5. J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski "MolPILE - large-scale, diverse dataset for molecular representation learning" ArXiv preprint
  6. M. Fitzner et al. "BayBE: a Bayesian Back End for experimental planning in the low-to-no-data regime" RSC Digital Discovery
  7. J. Xiong et al. "Bridging 3D Molecular Structures and Artificial Intelligence by a Conformation Description Language"
  8. S. Mavlonazarova et al. "Untargeted Metabolomics Reveals Organ-Specific and Extraction-Dependent Metabolite Profiles in Endemic Tajik Species Ferula violacea Korovin" bioRxiv preprint

Contributing

Please read CONTRIBUTING.md and CODE_OF_CONDUCT.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_fingerprints-2.0.0.tar.gz (303.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scikit_fingerprints-2.0.0-py3-none-any.whl (489.4 kB view details)

Uploaded Python 3

File details

Details for the file scikit_fingerprints-2.0.0.tar.gz.

File metadata

  • Download URL: scikit_fingerprints-2.0.0.tar.gz
  • Upload date:
  • Size: 303.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for scikit_fingerprints-2.0.0.tar.gz
Algorithm Hash digest
SHA256 66056963c7cd41d3ccbc38cffb1f8d4fccd02a195726f08898087b9f4c0d5bb4
MD5 3f99fd7077cdd5e8a4f2f739dcf96011
BLAKE2b-256 11cc94c56c41ec59c2ab2ea9b6ac36a08f3a0819b2cd25f203e55661e9c64973

See more details on using hashes here.

File details

Details for the file scikit_fingerprints-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: scikit_fingerprints-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 489.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for scikit_fingerprints-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74b2a634b21f6fb1234a11ae2cdda3db94cb2833bb89652a2c3c05f2d71f206f
MD5 f03f45f4799e8b457abb475a04753f8a
BLAKE2b-256 777478ddf3166c2fb0eeacb5d44108aec87e7c95807d7a0f285fc1d5660de114

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page