Skip to main content

Count splitting for random sampled count matrices

Project description

README

What is this repository for?

This is a python implementation of Anna Neufeld's paper's new approach to fixing the "double dipping" problem in doing DEG analysis on a test split between clusters that were defined using a training split. Make sure to cite them too even if you're using my python package!

Check out their paper here: https://arxiv.org/abs/2207.00554 And their R implementation here: https://anna-neufeld.github.io/countsplit/

How do I get set up?

python3 -m pip install count_split

You can also install using the setup.py script in the distribution like so: python3 setup.py install

How do I run use this package?

This package assumes that the imput matrix is organized with samples in columns, and variables in rows. For single-cell experiments, this is cells in columns and genes in rows. Make sure that this is the case, or transpose the matrix when calling the pertinent function To keep memory use low, we do it peice-meal, breaking the columns into bins. If you have memory issues, try decreaseing bin_size to something lower (default: bin_size=5000)

** If you've got a dense or sparse matrix:

  • Note that if you're using scanpy/anndata, the hdf5 file will often have an "X" object, that is typically a sparse matrix.
import numpy as np
from scipy.sparse import csc_matrix
from count_split.count_split import multi_split

in_mat = np.random.negative_binomial(.1, .1, size=(1000,5000))

mat1, mat2 = multi_split(in_mat, 
                percent_vect=[0.5, 0.5],
                bin_size = 5000)

## It also works for sparse matrices:
mat1, mat2 = multi_split(csc_matrix(in_mat), 
                percent_vect=[0.5, 0.5],
                bin_size = 5000)

** If you've got an hdf5 file with a dense matrix stored under a specified key (default key is "infile"), you can split that too

from count_split.count_split import split_mat_counts_h5
split_mat_counts_h5(in_mat_file, out_mat_file_1, out_mat_file_2, percent_1=0.5, bin_size=5000, key="infile")

License

This package is available via the AGPLv3 license.

Who do I talk to?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

count_split-0.0.99.tar.gz (4.0 kB view details)

Uploaded Source

Built Distributions

count_split-0.0.99-py3.8.egg (7.9 kB view details)

Uploaded Source

count_split-0.0.99-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file count_split-0.0.99.tar.gz.

File metadata

  • Download URL: count_split-0.0.99.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5

File hashes

Hashes for count_split-0.0.99.tar.gz
Algorithm Hash digest
SHA256 420a1a4e3fa58d8c74748c2fc95d14820ba3a8b19ce75002c71b8a9cc9860394
MD5 20bf50607ad4b810e1f5d73139709595
BLAKE2b-256 712ac485e1af67a636d3b1023c1a7ea23bde6c7dbfa7cb2b3beb0898ce4ea197

See more details on using hashes here.

File details

Details for the file count_split-0.0.99-py3.8.egg.

File metadata

  • Download URL: count_split-0.0.99-py3.8.egg
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5

File hashes

Hashes for count_split-0.0.99-py3.8.egg
Algorithm Hash digest
SHA256 5e4cb21d3342ad53a01380bae89abf4dbabf77660c47a8a2af32f9bcec30731f
MD5 efcee12f4e76294ee4c91a8681c1bfb2
BLAKE2b-256 e984d50f33fd5556676bc12b11cfbc5a815b5d0280967afff4e853cc168cc57a

See more details on using hashes here.

File details

Details for the file count_split-0.0.99-py3-none-any.whl.

File metadata

  • Download URL: count_split-0.0.99-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.6.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.64.1 CPython/3.8.5

File hashes

Hashes for count_split-0.0.99-py3-none-any.whl
Algorithm Hash digest
SHA256 80089e111ab024ab31dc73c03b201a6a4ca3a53c1cb45c84a7d94fd60e36a231
MD5 c3e7fe49f03bd8e14e4f9ecad5b6fe3d
BLAKE2b-256 a47e6b8c44f92e80609208e89600cd6e76e275a3fd90f200ac1f3e980bf98554

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page