Generate collision-free Morgan fingerprints with hashed or unfolded mode.
Project description
bit_collision_free_MF
A Python package for generating molecular fingerprints without bit collisions.
Description
bit_collision_free_MF generates count-based Morgan fingerprints while eliminating bit collisions, which can significantly improve the accuracy and reliability of molecular fingerprints in cheminformatics applications.
Two modes are supported:
- Hashed mode (default): Uses RDKit's hashed Morgan fingerprint and automatically finds the minimal fingerprint length that avoids all bit collisions via exponential growth + binary search. Collision detection is based on comparing Morgan invariants (unique substructure identifiers), which catches all collision types — including collisions between different substructures at the same radius.
- Unfolded mode: Builds a direct mapping from Morgan invariant IDs to column indices. Each unique substructure gets its own column, guaranteeing zero collisions with the minimal possible number of columns. This mode is especially useful for large datasets where the hashed mode would produce very long fingerprints, and it provides full interpretability via
get_invariant_mapping().
Installation
Requirements
- Python 3.9 or higher
- numpy
- pandas
- rdkit
Simple Installation
pip install -U bit_collision_free_MF
This will automatically install all dependencies, including RDKit.
Manual Installation
# Install dependencies
pip install numpy pandas rdkit
# Install the package
pip install -U bit_collision_free_MF
For development installation:
git clone https://github.com/Shifa-Zhong/bit_collision_free_MF.git
cd bit_collision_free_MF
pip install -e .
Features
- Two modes:
"hashed"(optimized-length hashed fingerprint) and"unfolded"(direct invariant-to-column mapping) - Invariant-based collision detection: compares actual substructure identifiers, not just radii, to guarantee truly collision-free fingerprints
- Count-based output: generates count fingerprints (not binary), preserving substructure frequency information
- Unfolded mode benefits: minimal column count (= number of unique substructures), lower memory usage for large datasets, and full interpretability via
get_invariant_mapping() - Supports all radius values including radius=0
- Consistent zero-column removal: columns identified during
fit()are reused intransform(), ensuring train/test dimensionality alignment - Feature names (
fp_0,fp_1, ...) are 0-indexed to match bit positions in RDKit'sbitInfo, enabling correct substructure interpretation - Easy CSV export with customizable headers
- Seamless integration with pandas and NumPy
Usage
Basic Usage (Hashed Mode — default)
from bit_collision_free_MF import generate_fingerprints, save_fingerprints
import pandas as pd
# Load your data
data = pd.read_csv('your_molecules.csv')
# Generate fingerprints (hashed mode, default)
fingerprints, fp_generator = generate_fingerprints(
data,
smiles_column='smiles',
radius=1,
remove_zero_columns=True
)
# Save fingerprints to CSV
save_fingerprints(
fingerprints,
fp_generator,
output_path='path/to/output.csv',
include_header=True
)
Unfolded Mode
Unfolded mode maps each unique substructure to its own column, producing the smallest possible collision-free fingerprint. This is recommended for large datasets or when interpretability is important.
from bit_collision_free_MF import generate_fingerprints
# Generate fingerprints in unfolded mode
fingerprints, fp_generator = generate_fingerprints(
data,
smiles_column='smiles',
radius=2,
mode="unfolded",
remove_zero_columns=True
)
# Interpretability: inspect what substructure each column represents
mapping = fp_generator.get_invariant_mapping() # {col_index: invariant_id}
# Visualize a specific substructure with RDKit
from rdkit import Chem
from rdkit.Chem import AllChem, Draw
inv_id = mapping[42] # invariant ID for column 42
mol = Chem.MolFromSmiles("c1ccccc1O")
bi = {}
fp = AllChem.GetMorganFingerprint(mol, radius=2, bitInfo=bi)
if inv_id in bi:
img = Draw.DrawMorganBit(mol, inv_id, bi)
Using the CollisionFreeMorganFP Class Directly
from bit_collision_free_MF import CollisionFreeMorganFP
import pandas as pd
# Load your data
data = pd.read_csv('your_molecules.csv')
smiles_list = data['smiles'].tolist()
# Create and fit the fingerprint generator
fp_generator = CollisionFreeMorganFP(radius=1) # hashed mode (default)
# fp_generator = CollisionFreeMorganFP(radius=1, mode="unfolded") # or unfolded mode
fp_generator.fit(smiles_list, remove_zero_columns=True)
# Generate fingerprints
fingerprints = fp_generator.transform(smiles_list)
# Get feature names (fp_0, fp_1, ... aligned with bit indices)
feature_names = fp_generator.get_feature_names()
# Create a DataFrame with the fingerprints
result_df = pd.DataFrame(fingerprints, columns=feature_names)
result_df.to_csv('fingerprints.csv', index=False)
Train/Test Split with Consistent Dimensions
from bit_collision_free_MF import CollisionFreeMorganFP
# fit() on training set records which columns are all-zero
fp_gen = CollisionFreeMorganFP(radius=1, mode="unfolded")
fp_gen.fit(train_smiles, remove_zero_columns=True)
# transform() reuses the same mapping and zero-column mask for both sets
X_train = fp_gen.transform(train_smiles)
X_test = fp_gen.transform(test_smiles)
# X_train.shape[1] == X_test.shape[1] guaranteed
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For academic inquiries or collaboration, please contact:
- Shifa Zhong (sfzhong@tongji.edu.cn)
- Jibai Li (51263903065@stu.ecnu.edu.cn)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bit_collision_free_mf-1.0.0.tar.gz.
File metadata
- Download URL: bit_collision_free_mf-1.0.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
211963ff10cfab84616e7b2a195c7627ebe8f28163795e6cb7309278cad1bb16
|
|
| MD5 |
b45ce61d0dc6e99a3c88a4cd0dbfabb1
|
|
| BLAKE2b-256 |
037a30ca83b59313976e27baea598d655b9d6d66d2c82d5f5b5d464f4d0add8a
|
File details
Details for the file bit_collision_free_mf-1.0.0-py3-none-any.whl.
File metadata
- Download URL: bit_collision_free_mf-1.0.0-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c1e1e689f533464dd2bc18321275fcff2a5d0a4657a22ea407ae2e532c4d8f5
|
|
| MD5 |
4ffdae1d5c0b82deb42185bf8dc33ad3
|
|
| BLAKE2b-256 |
e87ce861396a2f119bd58b3e42cd5f755b23ff725bc7fdad9ea2a3981be883e0
|