**LaplacianNB** is a Python module developed at **Novartis AG** for Naive Bayes classifier for laplacian modified models based on scikit-learn Naive Bayes implementation.

Project description

LaplacianNB

Naive Bayes classifier for Laplacian-modified models
Efficient, scikit-learn compatible, and designed for binary/boolean data

LaplacianNB is a Python module developed at Novartis AG for a Naive Bayes classifier for Laplacian-modified models, based on the scikit-learn Naive Bayes implementation.

This classifier is ideal for binary/boolean data, using only the indices of positive bits for efficient prediction. The algorithm was first implemented in Pipeline Pilot and KNIME.

The package includes both a modern sklearn-compatible implementation (recommended) and a legacy version for backward compatibility.

✨ Features

🔬 Core Algorithm

Laplacian-modified Naive Bayes with enhanced smoothing for sparse data
Optimized for binary/boolean features using bit index representation
Fast prediction leveraging only positive bit indices
Robust handling of unseen features and classes

🚀 Performance & Scalability

Memory-efficient sparse matrix support for massive feature spaces (2^32 features)
Lossless RDKit fingerprint conversion with bit reinterpretation
Automatic sparsity detection and optimization
Parallel processing compatible with joblib

🔧 sklearn Integration

Full sklearn ecosystem compatibility (pipelines, cross-validation, grid search)
Drop-in replacement for other Naive Bayes classifiers
Consistent API with sklearn estimators
Custom transformers for molecular data preprocessing

🧪 Molecular Informatics

Direct RDKit integration for SMILES conversion
Morgan fingerprint support with configurable radius
Chemical space analysis capabilities
QSAR/SAR modeling optimized workflows

Installation

Stable Release

Install the latest stable release from PyPI:

pip install laplaciannb

Development Version

Get the latest features with development releases:

pip install --pre laplaciannb

From Source

For the latest development version with examples:

git clone https://github.com/rdkit/laplaciannb.git
cd laplaciannb
pip install -e ".[dev]"  # Includes development dependencies

Optional Dependencies

For molecular fingerprint functionality:

pip install rdkit  # For molecular fingerprint conversion

For full development environment:

pip install laplaciannb[dev]  # Includes testing, linting, and examples

Quick Start

🚀 Try the Interactive Example

Run the comprehensive quickstart example to see all features in action:

cd examples
python quickstart_example.py

This script demonstrates:

RDKit molecular fingerprint conversion
Sparse matrix handling for memory efficiency
scikit-learn ecosystem integration
Performance comparisons with other classifiers
Memory efficiency demonstrations

Recommended Usage (Modern sklearn-compatible API)

For molecular data with RDKit:

from laplaciannb import LaplacianNB
from laplaciannb.fingerprint_utils import rdkit_to_csr

# Sample molecular data (SMILES strings)
smiles = [
    "CCO",                              # Ethanol
    "CC(=O)OC1=CC=CC=C1C(=O)O",        # Aspirin
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"    # Ibuprofen
]
y = [0, 1, 1]  # Activity labels

# Convert to sparse CSR matrix (memory efficient)
X = rdkit_to_csr(smiles, radius=2)
print(f"Matrix shape: {X.shape}")  # (3, 4294967296)
print(f"Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.6f}")

# Train classifier
clf = LaplacianNB(alpha=1.0)
clf.fit(X, y)

# Make predictions
predictions = clf.predict(X)
probabilities = clf.predict_proba(X)

For general binary/boolean data:

import numpy as np
from scipy.sparse import csr_matrix
from laplaciannb import LaplacianNB

# Create sparse binary matrix directly
row = [0, 0, 1, 1, 2, 2]
col = [1, 5, 2, 6, 1, 3]
data = [1, 1, 1, 1, 1, 1]
X = csr_matrix((data, (row, col)), shape=(3, 10), dtype=np.bool_)
y = [0, 1, 0]

# Train and predict
clf = LaplacianNB(alpha=1.0)
clf.fit(X, y)
predictions = clf.predict(X)
probabilities = clf.predict_proba(X)

sklearn Ecosystem Integration

Full Pipeline Example:

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from laplaciannb import LaplacianNB
from laplaciannb.fingerprint_utils import rdkit_to_csr

# Custom transformer for pipelines
class RDKitFingerprintTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, radius=2):
        self.radius = radius

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return rdkit_to_csr(X, radius=self.radius)

# Create pipeline
pipeline = Pipeline([
    ('fingerprints', RDKitFingerprintTransformer(radius=2)),
    ('classifier', LaplacianNB(alpha=1.0))
])

# Grid search
param_grid = {
    'classifier__alpha': [0.1, 1.0, 10.0],
    'fingerprints__radius': [1, 2, 3]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(smiles_data, y)  # Use SMILES directly in pipeline

# Cross-validation
cv_scores = cross_val_score(pipeline, smiles_data, y, cv=5)
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Direct sparse matrix usage (for pre-converted data)
X_sparse = rdkit_to_csr(smiles_data, radius=2)
clf = LaplacianNB(alpha=1.0)
scores = cross_val_score(clf, X_sparse, y, cv=5)

🔥 Key Features & Advantages

Memory Efficiency

Sparse matrix support: Handle 2^32 feature spaces with minimal memory
Lossless fingerprint conversion: Convert RDKit fingerprints without data loss
Automatic sparsity detection: Works seamlessly with both sparse and dense data

# Handle massive feature spaces efficiently
X = rdkit_to_csr(smiles_list, radius=2)  # Shape: (n_samples, 4294967296)
print(f"Memory usage: {X.data.nbytes / 1024**2:.1f} MB")  # Only a few MB!

Performance

Optimized for binary data: Fast prediction using only positive bit indices
sklearn compatible: Drop-in replacement for other Naive Bayes classifiers
Parallel processing: Supports joblib parallelization

Molecular Informatics

RDKit integration: Direct conversion from molecular structures
Flexible fingerprints: Support for Morgan, MACCS, and custom fingerprints
Chemical space analysis: Ideal for QSAR/SAR modeling

📚 Examples & Tutorials

Interactive Examples

Explore the comprehensive examples in the /examples directory:

quickstart_example.py: Complete demonstration with molecular data
basic_usage_tutorial.ipynb: Step-by-step Jupyter notebook
sklearn_integration_tutorial.ipynb: Advanced sklearn integration
bayes_tutorial.ipynb: Deep dive into Naive Bayes concepts

Run the Quickstart

# Clone the repository
git clone https://github.com/rdkit/laplaciannb.git
cd laplaciannb

# Install with examples
pip install -e ".[dev]"

# Run comprehensive example
python examples/quickstart_example.py

Example Outputs

The quickstart example demonstrates:

BASIC LAPLACIANNB USAGE
Matrix shape: (10, 4294967296)
Matrix sparsity: 0.999998
Training completed in 0.002 seconds
Test Accuracy: 1.000

SPARSE MATRIX EFFICIENCY
Radius   Features     Sparsity   Train Time   Accuracy
1        4,294,967,296 0.999999   0.001       1.000
2        4,294,967,296 0.999998   0.002       1.000
3        4,294,967,296 0.999997   0.003       1.000

MEMORY EFFICIENCY
Sparse matrix memory: 0.12 MB
Dense equivalent would require 40,000+ MB!
✓ Designed specifically for extremely sparse binary features


### Legacy Usage (Deprecated)

> **⚠️ DEPRECATION NOTICE:** The legacy API is deprecated and will be removed in a future release. Please migrate to the modern sklearn-compatible API above.

```python
# For backward compatibility only - will show deprecation warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

from laplaciannb.legacy import LaplacianNB as LegacyLaplacianNB

# Legacy format (sets of bit indices)
X_sets = np.array([{1, 5, 10}, {2, 6, 11}, {1, 3, 7}], dtype=object)
y = [0, 1, 0]

clf = LegacyLaplacianNB(alpha=1.0)
clf.fit(X_sets, y)
predictions = clf.predict(X_sets)

Migration Guide

Migrating from legacy to modern implementation is easy:

Update imports:

# Before (deprecated)
from laplaciannb.legacy import LaplacianNB

# After (recommended)
from laplaciannb import LaplacianNB
from laplaciannb.fingerprint_utils import convert_fingerprints

Convert input data:

# Convert fingerprint sets to sklearn format
X = convert_fingerprints(your_fingerprint_sets, n_bits=your_size)

Same API for basic usage:

clf = LaplacianNB(alpha=1.0)
clf.fit(X, y)
predictions = clf.predict(X)

📖 Detailed migration instructions: MIGRATION_GUIDE.md 📅 Deprecation timeline: DEPRECATION_TIMELINE.md

Basic Usage with LaplacianNB

import numpy as np
from laplaciannb import LaplacianNB

# Create sample data (sets of positive bit indices)
X = np.array([
    {1, 5, 10, 15},      # Sample 1: bits 1,5,10,15 are on
    {2, 6, 11, 16},      # Sample 2: bits 2,6,11,16 are on
    {1, 3, 7, 12},       # Sample 3: bits 1,3,7,12 are on
], dtype=object)
y = np.array([0, 1, 0])  # Class labels

# Train the classifier
clf = LaplacianNB()
clf.fit(X, y)

# Make predictions
predictions = clf.predict(X)
probabilities = clf.predict_proba(X)

RDKit Fingerprint Integration

from rdkit import Chem
from rdkit.Chem import AllChem
from laplaciannb import LaplacianNB, convert_fingerprints

# Generate molecular fingerprints
molecules = [Chem.MolFromSmiles(smi) for smi in ['CCO', 'CC', 'CCC']]
fingerprints = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in molecules]

# Convert to sklearn-compatible format
X = convert_fingerprints(fingerprints, output_format='csr')
y = [0, 1, 0]

# Train classifier
clf = LaplacianNB()
clf.fit(X, y)

Advanced Fingerprint Conversion

from laplaciannb import RDKitFingerprintConverter

# Create converter with custom settings
converter = RDKitFingerprintConverter(
    n_bits=2048,
    output_format='auto',  # Automatically choose sparse/dense
    dtype=np.float32
)

# Convert fingerprints
X_dense = converter.to_dense(fingerprints)
X_sparse = converter.to_csr(fingerprints)

# Get statistics
stats = converter.get_statistics(fingerprints)
print(f"Sparsity: {stats['sparsity']:.2%}")
print(f"Average on-bits: {stats['avg_on_bits']:.1f}")

Development

Contributing

We welcome contributions! Please see our development setup:

# Clone the repository
git clone https://github.com/rdkit/laplaciannb.git
cd laplaciannb

# Install in development mode with test dependencies
pip install -e .[test]

# Install pre-commit hooks
pre-commit install

# Run tests
pytest tests/

# Run quality checks
pre-commit run --all-files

CI/CD Pipeline

Code Quality: Ruff linting and formatting
Testing: Multi-Python version testing with coverage
Security: Bandit security scanning
Auto-publishing: Development versions on merge to develop
Dependency Management: Dependabot for automated updates

Project Structure

laplaciannb/
├── src/laplaciannb/           # Main package
│   ├── LaplacianNB_new.py     # Modern implementation
│   ├── fingerprint_utils.py   # Conversion utilities
│   └── legacy/                # Deprecated legacy API
├── tests/                     # Test suite
├── .github/                   # CI/CD workflows
└── docs/                      # Documentation

Literature

Nidhi; Glick, M.; Davies, J. W.; Jenkins, J. L. Prediction of biological targets
for compounds using multiple-category Bayesian models trained on chemogenomics
databases. J. Chem. Inf. Model. 2006, 46, 1124– 1133,
https://doi.org/10.1021/ci060003g

Lam PY, Kutchukian P, Anand R, et al. Cyp1 inhibition prevents doxorubicin-induced cardiomyopathy
in a zebrafish heart-failure model. Chem Bio Chem. 2020:cbic.201900741.
https://doi.org/10.1002/cbic.201900741

Authors & Maintainers

Bartosz Baranowski (bartosz.baranowski@novartis.com)
Edgar Harutyunyan (edgar.harutyunyan_ext@novartis.com)

Changelog

v0.7.0 (Latest)

Sklearn integration handling standard sklearn input allowing for full integration with sklearn framework
Enhanced deprecation strategy with comprehensive migration support
Legacy input detection in new version with helpful error messages
Dependabot configuration for automated dependency updates

v0.6.1

Fixes for scikit-learn 1.7, rdkit 2025+ compatibility
Move to uv build system

v0.6.0

Move to pdm build system

v0.5.0

Initial public release

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

0.8.0

Aug 28, 2025

0.8.0.dev202508281345 pre-release

Aug 28, 2025

0.8.0.dev202508281343 pre-release

Aug 28, 2025

This version

0.8.0.dev202508211041 pre-release

Aug 21, 2025

0.7.0.dev202508181649 pre-release

Aug 18, 2025

0.6.1

Jul 18, 2025

0.6.0

Feb 28, 2025

0.5.1

Jan 6, 2023

0.5.0

Oct 3, 2022

0.4.1

Aug 30, 2022

0.4

Aug 30, 2022

0.3

Aug 18, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

laplaciannb-0.8.0.dev202508211041.tar.gz (24.1 kB view details)

Uploaded Aug 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

laplaciannb-0.8.0.dev202508211041-py3-none-any.whl (19.7 kB view details)

Uploaded Aug 21, 2025 Python 3

File details

Details for the file laplaciannb-0.8.0.dev202508211041.tar.gz.

File metadata

Download URL: laplaciannb-0.8.0.dev202508211041.tar.gz
Upload date: Aug 21, 2025
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for laplaciannb-0.8.0.dev202508211041.tar.gz
Algorithm	Hash digest
SHA256	`ea7f9dbd6522f96add7195ef6d6084a743c305b85fe3fd387f3e8a2d886c4c8a`
MD5	`b93d658e6d8e98f4df356f4d5100d3d4`
BLAKE2b-256	`acbebddadb52a5e12552b5551453e56c13a6d4fe8b717642c202dcce0cde240f`

See more details on using hashes here.

File details

Details for the file laplaciannb-0.8.0.dev202508211041-py3-none-any.whl.

File metadata

Download URL: laplaciannb-0.8.0.dev202508211041-py3-none-any.whl
Upload date: Aug 21, 2025
Size: 19.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for laplaciannb-0.8.0.dev202508211041-py3-none-any.whl
Algorithm	Hash digest
SHA256	`79ede2ee4bb0b2144e92170cd0aef63bf03c386e328c17c87aef53207e58622c`
MD5	`3e2b3c35e93e986b0f5d6dbbf7466adb`
BLAKE2b-256	`790700bdec8e1267b5e014eb8bb21ea8f9883c24ee8c3c7df7e377c0989f6e96`

See more details on using hashes here.

laplaciannb 0.8.0.dev202508211041

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LaplacianNB

✨ Features

🔬 Core Algorithm

🚀 Performance & Scalability

🔧 sklearn Integration

🧪 Molecular Informatics

Installation

Stable Release

Development Version

From Source

Optional Dependencies

Quick Start

🚀 Try the Interactive Example

Recommended Usage (Modern sklearn-compatible API)

sklearn Ecosystem Integration

🔥 Key Features & Advantages

Memory Efficiency

Performance

Molecular Informatics

📚 Examples & Tutorials

Interactive Examples

Run the Quickstart

Example Outputs

Migration Guide

Basic Usage with LaplacianNB

RDKit Fingerprint Integration

Advanced Fingerprint Conversion

Development

Contributing

CI/CD Pipeline

Project Structure

Literature

Authors & Maintainers

Changelog

v0.7.0 (Latest)

v0.6.1

v0.6.0

v0.5.0

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes