Skip to main content

An Elegant Data Analysis Tool for CyTOF

Project description

Logo

PyCytoData

An elegant data analysis tool for CyTOF.

This package is an all-in-one CyTOF data analysis package for your experiments. From loading datasets to DR and evaluation, you have a consistent interface and readable code every step along the way. There is also support for some of HDCytoData's benchmark datasets as originally implemented in R by Weber & Soneson (2019) in this repository. Why wait? Start your PyCytoData journey right here, right now!

Installation

You can install PyCytoData easily from pip:

pip install PyCytoData

or from conda:

conda install pycytodata -c kevin931 -c bioconda

If you wish to use CytofDR along with PyCytoData, use can optionally install it as well:

pip install CytofDR

For more information on optional dependencies or installation details, look here.

Install and Load Benchmark Datasets

You can load the data easily with the following python snippet:

>>> from PyCytoData import DataLoader

>>> exprs = DataLoader.load_dataset(dataset = "levine13")
>>> exprs.expression_matrix # Expression matrix
>>> exprs.cell_types # Cell types
>>> exprs.sample_index # Sample index
>>> exprs.features # The feature/marker names

The resulting exprs is a PyCytoData object, which is easy to use. The expression matrix, cell types (if available), and sample index are directly accessible with attributes, and they are all stored as numpy.array. You can also access some metadata of the object with the following attributes:

>>> exprs.n_cells
>>> exprs.n_cell_types
>>> exprs.n_samples
>>> exprs.n_features

All these metadata is automatically set, and there is protection in place for unintended changes. You can also add a sample with the following:

>>> exprs.add_sample(expression_matrix, cell_types, sample_index) # All inputs should be ArrayLike

Note: The data are downloaded from a server instead of being shipped with this package. Each dataset only needs to be downloaded once, which is automatically managed. During the first-time download of the data, a command-line confirmation is needed.

Bring Your Own Dataset (BYOD)

Yes, you read it right! You can load your own datasets. Currently, we only support reading in plain text files with saved with delimiters. The data need to have cells as rows and features as columns. To do load them in as a PyCytoData object, you can simply do the following:

>>> from PyCytoData import FileIO

>>> FileIO.load_delim(files="/path", # Path to file
...                   col_names=True, # Whether the first row is feature (column) names 
...                   delim="\t" # Delimiter
...                  ) 

If your experiment has multiple samples, you can simply import them together:

>>> from PyCytoData import FileIO

>>> expression_paths = ["path1", "path2", "path3"]
>>> FileIO.load_delim(files=expression_paths, # Path to file
...                   col_names=True, # Whether the first row is feature (column) names 
...                   delim="\t" # Delimiter
...                  ) 

In this case, the expression matrices are concatenated automatically without any normalization. To access particular samples, you can access the sample_index of the attribute and use the standard numpy indexing techniques.

Note: This technique does not automatically load cell types. In fact, it does not not mixed-datatype array, except for column names. You will need to read in cell types and set them using the cell_types attribute of the object.

Preprocessing

Currently, levine13, levine32, and samusik have all been mostly preprocessed. All you need to do is to perform aecsinh transformaion. You can simply do this:

>>> from PyCytoData import DataLoader

>>> exprs = DataLoader.load_dataset(dataset = "levine13", preprocess=True)

When you perform BYOD, you can have much more flexibility:

>>> from PyCytoData import FileIO

>>> byod = FileIO.load_delim(files="/path", # Path to file
...                          col_names=True, # Whether the first row is feature (column) names 
...                          delim="\t" # Delimiter
...                         )
>>> byod.lineage_channels = ["CD4", "CD8", "FoxP3", "CD15"]
>>> byod.preprocess(arcsinh=True,
...                 gate_debris_removal=True,
...                 gate_intact_cells=True,
...                 gate_live_cells=True,
...                 gate_center_offset_residual=True,
...                 bead_normalization=True)

>>> byod.expression_matrix # This is preprocessed

As the example shows, we support five unique preprocessing steps! And of course, you can use a subset of these to suit your own needs! By default, we automatically detect the necessary channels, such as "Bead1" or "Center". However, if your dataset is unconventionally named, our auto-detect algorithm may fail. Thus, we can perform a manual override:

>>> byod.preprocess(arcsinh=True,
...                 gate_debris_removal=True,
...                 gate_intact_cells=True,
...                 gate_live_cells=True,
...                 gate_center_offset_residual=True,
...                 bead_normalization=True,
...                 bead_channels = ["1bead", "2bead"],
...                 time_channel = ["clock"])

Dimension Reduction

If you wish to run DR on your dataset, you can easily do so as well if you have CytofDR installed (assume you have loaded the dataset and preprocessed it accordingly):

>>> exprs.run_dr_methods(methods = ["PCA", "UMAP", "ICA"])
Running PCA
Running ICA
Running UMAP
>>> type(exprs.reductions)
<class 'CytofDR.dr.Reductions'>

The reductions attribute is a Reductions object from CytofDR. You can perform all downstream DR workflows as usual.

Datasets Supported

We only support the following datasets as of now. The Literal is the string literal used in this package to refer to the datasets whereas the Dataset Name is what these datasets are more commonly known for.

Dataset Name Literal
Levine-13dim levine13
Levine-32dim levine32
Samusik samusik

More datasets will be added in the future to be fully compatible with HDCytoData and to potentially incorporate other databases.

Documentation

For detailed documentation along with tutorials and API Reference, please visit our Official Documentation. This is automatically updated with each update.

If you prefer to build documentation on your own, refer to this guide for more details.

Latest Release: 0.1.3

This is a minor release that fixes a critical bug that affects all previous releases. Update is strongly recommended. A few quality-of-life improvements are included as well.

Bug Fixes

  • Fixed a critical issue with subsetting channels not updating internal indices lineage channels.
  • Fixed a verbiage error for subsetting error messages. Now, it is explicitly stated that integer indexing is not supported.
  • Updated documentation to fix typos.

Changes and New Features

  • Updated CI pipeline to include newest Python releases.
  • Added our logo usage policy.
  • Clarified the python version needed to run PyCytoData.
  • No new software feature added.

References

If you used PyCytoData in your research or with Cytomulate as part of the pipeline, please cite our paper here:

Yang, Y., Wang, K., Lu, Z. et al. Cytomulate: accurate and efficient simulation of CyTOF data. Genome Biol 24, 262 (2023). https://doi.org/10.1186/s13059-023-03099-1

or with our BibTex:

@article {Yang2023,
	author = {Yang, Yuqiu and Wang, Kaiwen and Lu, Zeyu and Wang, Tao and Wang, Xinlei},
	title = {Cytomulate: accurate and efficient simulation of CyTOF data},
  journal={Genome biology},
  volume={24},
  number={262},
  year={2023},
  publisher={Springer}
}

If you use PyCytoData to perform DR, citing the our DR Review paper is highly appreciated:

Wang, K., Yang, Y., Wu, F. et al. Comparative analysis of dimension reduction methods for cytometry by time-of-flight data. Nat Commun 14, 1836 (2023). https://doi.org/10.1038/s41467-023-37478-w

or

@article{wang2023comparative,
  title={Comparative analysis of dimension reduction methods for cytometry by time-of-flight data},
  author={Wang, Kaiwen and Yang, Yuqiu and Wu, Fangjiang and Song, Bing and Wang, Xinlei and Wang, Tao},
  journal={Nature communications},
  volume={14},
  number={1},
  pages={1--18},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

If you use the builtin datasets, please visit our Reference Page and cite the papers accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyCytoData-0.1.3.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PyCytoData-0.1.3-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file PyCytoData-0.1.3.tar.gz.

File metadata

  • Download URL: PyCytoData-0.1.3.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.8.13

File hashes

Hashes for PyCytoData-0.1.3.tar.gz
Algorithm Hash digest
SHA256 20f161d9822c69e36b7253a5170e270427340a613397b47e59a04f02020f1a53
MD5 50339c1694b74e610834277ad6f14117
BLAKE2b-256 f7f5e97021eeed6d1db260fbba40569b89265273958d31c3bb2628cca317f550

See more details on using hashes here.

File details

Details for the file PyCytoData-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: PyCytoData-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.11.3 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.8.13

File hashes

Hashes for PyCytoData-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b3a72619572f7f70571eb88a3adbb7ee0427cb45e7bcca6f8c8bd0cf1a8612ba
MD5 c90fcd0b9acd04b2e592eaaa0002a230
BLAKE2b-256 7d34c844a8fb015b2c447c3faf14805320e926816e729330b8369d9c5d318723

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page