Count splitting for random sampled count matrices
Project description
README
What is this repository for?
This is a python implementation of Anna Neufeld's paper's new approach to fixing the "double dipping" problem in doing DEG analysis on a test split between clusters that were defined using a training split. Make sure to cite them too even if you're using my python package!
Check out their paper here: https://arxiv.org/abs/2207.00554 And their R implementation here: https://anna-neufeld.github.io/countsplit/
How do I get set up?
python3 -m pip install count_split
You can also install using the setup.py script in the distribution like so:
python3 setup.py install
How do I run use this package?
This package assumes that the imput matrix is organized with samples in columns, and variables in rows. For single-cell experiments, this is cells in columns and genes in rows. Make sure that this is the case, or transpose the matrix when calling the pertinent function To keep memory use low, we do it peice-meal, breaking the columns into bins. If you have memory issues, try decreaseing bin_size to something lower (default: bin_size=5000)
** If you've got a dense or sparse matrix:
- Note that if you're using scanpy/anndata, the hdf5 file will often have an "X" object, that is typically a sparse matrix.
import numpy as np
from scipy.sparse import csc_matrix
from count_split.count_split import multi_split
in_mat = np.random.negative_binomial(.1, .1, size=(1000,5000))
mat1, mat2 = multi_split(in_mat,
percent_vect=[0.5, 0.5],
bin_size = 5000)
## It also works for sparse matrices:
mat1, mat2 = multi_split(csc_matrix(in_mat),
percent_vect=[0.5, 0.5],
bin_size = 5000)
** If you've got an hdf5 file with a dense matrix stored under a specified key (default key is "infile"), you can split that too
from count_split.count_split import split_mat_counts_h5
split_mat_counts_h5(in_mat_file, out_mat_file_1, out_mat_file_2, percent_1=0.5, bin_size=5000, key="infile")
License
This package is available via the AGPLv3 license.
Who do I talk to?
- Repo owner/admin: scottyler89+bitbucket@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file count_split-0.0.99.tar.gz
.
File metadata
- Download URL: count_split-0.0.99.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 420a1a4e3fa58d8c74748c2fc95d14820ba3a8b19ce75002c71b8a9cc9860394 |
|
MD5 | 20bf50607ad4b810e1f5d73139709595 |
|
BLAKE2b-256 | 712ac485e1af67a636d3b1023c1a7ea23bde6c7dbfa7cb2b3beb0898ce4ea197 |
File details
Details for the file count_split-0.0.99-py3.8.egg
.
File metadata
- Download URL: count_split-0.0.99-py3.8.egg
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e4cb21d3342ad53a01380bae89abf4dbabf77660c47a8a2af32f9bcec30731f |
|
MD5 | efcee12f4e76294ee4c91a8681c1bfb2 |
|
BLAKE2b-256 | e984d50f33fd5556676bc12b11cfbc5a815b5d0280967afff4e853cc168cc57a |
File details
Details for the file count_split-0.0.99-py3-none-any.whl
.
File metadata
- Download URL: count_split-0.0.99-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80089e111ab024ab31dc73c03b201a6a4ca3a53c1cb45c84a7d94fd60e36a231 |
|
MD5 | c3e7fe49f03bd8e14e4f9ecad5b6fe3d |
|
BLAKE2b-256 | a47e6b8c44f92e80609208e89600cd6e76e275a3fd90f200ac1f3e980bf98554 |