A lightweight implementation of a multinomial Naive Bayes classifier for Annotation Transfer of single-cell data.
Project description
CAP-Naive-Bayes
A lightweight, extensible implementation of a multinomial Naive Bayes classifier in pure Python. It is designed for Annotation Transfer of single-cell data, allowing you to fit and predict on large datasets efficiently using out-of-core chunked processing.
Main Features:
- Out-of-core chunked processing: Efficiently handle large datasets without loading everything into memory.
- Support for missing features: Can handle datasets where some features are missing during prediction.
- Flexible data formats: Supports dense NumPy arrays, SciPy sparse matrices, AnnData/HDF5-backed data, Zarr arrays.
Installation
pip install -U cap-naive-bayes
Usage
Basic Usage
>> from cap_naive_bayes import NaiveBayesModel
>> count_matrix = np.array([
[2, 1, 0, 0],
[2, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 1, 1],
])
>> obs = pd.DataFrame({
'cell_type': ['a', 'a', 'a', 'b'],
})
>> features = pd.Index(['g1', 'g2', 'g3', 'g4'])
>> model = NaiveBayesModel()
>> model.fit(
X=count_matrix,
obs=obs,
features=features,
)
>> model # contains log prior and posterior probabilities
g1 g2 g3 g4 prior
labelset label
cell_type a -0.510826 -1.609438 -2.302585 -2.302585 -0.287682
b -1.252763 -1.945910 -1.252763 -1.252763 -1.386294
>> pred = model.predict(
X=count_matrix,
labelset="cell_type",
features=features,
)
>> pred
cell_type cell_type_conf
0 a 0.948776
1 a 0.929726
2 a 0.863014
3 b 0.564414
Chunked Processing
For very large X (e.g. Dask, Zarr, HDF5), pass a chunk size or let the model infer from X.chunks:
# inference of chunk size from .chunks attribute
model.fit(large_zarr_array, obs_df, feature_names, chunk=None)
# explicit chunking
model.predict(X_test, chunk=500)
Feature space allignment
When the feature space of X does not match the model's feature space, you can specify the features to use during prediction:
fs_train = pd.Index(['f1', 'f2', 'f3', 'f4', 'f5'])
X_train = ... # matrix with 5 columns
model.fit(X_train, features=fs_train, ...)
fs_test = pd.Index(['f1','f4','f5', 'f6'])
X_test = ... # matrix with 4 columns
pred = model.predict(X_test, features=fs_test) # valid, model will subsample 'f1', 'f4,, 'f5' from model and x_test.
Multiple labelsets
You can fit the model and make predctions on multiple labelsets by passing a multiple columns in obs DataFrame:
obs = pd.DataFrame({
'cell_type': ['a', 'a', 'a', 'b'],
'treatment': ['control', 'control', 'treatment', 'treatment']
})
model.fit(X_train, obs=obs, features=fs_train)
pred = model.predict(X_test, features=fs_test)
License & Acknowledgments
This project is released under the BSD 3-Clause License.
It also incorporates code derived from scikit-learn, which is licensed under the BSD 3‑Clause “New” or “Revised” License.
- scikit-learn
Copyright (C) 2007–2024 The scikit-learn developers
BSD 3‑Clause License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cap_naive_bayes-0.1.3.tar.gz.
File metadata
- Download URL: cap_naive_bayes-0.1.3.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d1145517ca38327684879a2a9e9a3e0d4c517d75f6e6764ed8615238bb7c58b
|
|
| MD5 |
11b021d7eaa73bb7255302d4ed035e5a
|
|
| BLAKE2b-256 |
3256ea69dd557b4882d591fa9c3fea077e181fe22fb6e0be0d69f7ccbd17512e
|
File details
Details for the file cap_naive_bayes-0.1.3-py3-none-any.whl.
File metadata
- Download URL: cap_naive_bayes-0.1.3-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a984d15107af323def10b93afcddc1c88f9f1765c0d5c74572ac8c035ac5a55c
|
|
| MD5 |
2acfe2a894d2b3bc1e7e45fcc03c2a5c
|
|
| BLAKE2b-256 |
693be552817be12701df67c1a8b41f3000c36aa2d36726004edf5d5350f22cd2
|