Skip to main content

Count splitting for random sampled count matrices

Project description

README

What is this repository for?

This is a python implementation of Anna Neufeld's paper's new approach to fixing the "double dipping" problem in doing DEG analysis on a test split between clusters that were defined using a training split. Make sure to cite them too even if you're using my python package!

Check out their paper here: https://arxiv.org/abs/2207.00554 And their R implementation here: https://anna-neufeld.github.io/countsplit/

How do I get set up?

python3 -m pip install count_split

You can also install using the setup.py script in the distribution like so: python3 setup.py install

How do I run use this package?

This package assumes that the imput matrix is organized with samples in columns, and variables in rows. For single-cell experiments, this is cells in columns and genes in rows. Make sure that this is the case, or transpose the matrix when calling the pertinent function To keep memory use low, we do it peice-meal, breaking the columns into bins. If you have memory issues, try decreaseing bin_size to something lower (default: bin_size=5000)

** If you've got a dense or sparse matrix:

  • Note that if you're using scanpy/anndata, the hdf5 file will often have an "X" object, that is typically a sparse matrix.
import numpy as np
from scipy.sparse import csc_matrix
from count_split.count_split import multi_split

in_mat = np.random.negative_binomial(.1, .1, size=(1000,5000))

mat1, mat2 = multi_split(in_mat, 
                percent_vect=[0.5, 0.5],
                bin_size = 5000)

## It also works for sparse matrices:
mat1, mat2 = multi_split(csc_matrix(in_mat), 
                percent_vect=[0.5, 0.5],
                bin_size = 5000)

** If you've got an hdf5 file with a dense matrix stored under a specified key (default key is "infile"), you can split that too

from count_split.count_split import split_mat_counts_h5
split_mat_counts_h5(in_mat_file, out_mat_file_1, out_mat_file_2, percent_1=0.5, bin_size=5000, key="infile")

License

This package is available via the AGPLv3 license.

Who do I talk to?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

count_split-0.0.99.tar.gz (4.0 kB view hashes)

Uploaded Source

Built Distributions

count_split-0.0.99-py3.8.egg (7.9 kB view hashes)

Uploaded Source

count_split-0.0.99-py3-none-any.whl (4.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page