Skip to main content

TileDB-based array storage for genomics data collections.

Project description

PyPI-Server Unit tests

Cell Arrays

Cell Arrays is a Python package that provides a TileDB-backed store for large collections of genomic experimental data, such as millions of cells across multiple single-cell experiment objects.

The CellArrDataset is designed to store single-cell RNA-seq datasets but can be generalized to store any 2-dimensional experimental data.

Install

To get started, install the package from PyPI

pip install cellarr

Usage

Create a CellArrDataset

Creating a CellArrDataset generates three TileDB files in the specified output directory:

  • gene_metadata: Contains feature annotations.
  • cell_metadata: Contains cell or sample metadata.
  • matrix: A TileDB-backed sparse array containing expression vectors.

The TileDB matrix file is stored in a cell X gene orientation. This orientation is chosen because the fastest-changing dimension as new files are added to the collection is usually the cells rather than genes.

Note: Currently only supports either paths to H5AD or AnnData objects

To build a CellArrDataset from a collection of H5AD or AnnData objects:

import anndata
import numpy as np
import tempfile
from cellarr import build_cellarrdataset, CellArrDataset

# Create a temporary directory
tempdir = tempfile.mkdtemp()

# Read AnnData objects
adata1 = anndata.read_h5ad("path/to/object1.h5ad")
# or just provide the path
adata2 = "path/to/object2.h5ad"

# Build CellArrDataset
dataset = build_cellarrdataset(
     output_path=tempdir,
     h5ad_or_adata=[adata1, adata2],
     matrix_dim_dtype=np.float32
)

TODO: This following section does not work yet.

Users have the option to reuse the dataset object retuned when building the dataset or by creating a CellArrDataset object by initializng it to the path where the files were created.

# Create a CellArrDataset object from the existing dataset
dataset = CellArrDataset(dataset_path=tempdir)

# Query data from the dataset
expression_data = dataset[10, ["gene1", "gene10", "gene500"]]

Note

This project has been set up using PyScaffold 4.5. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellarr-0.0.2.tar.gz (99.1 kB view hashes)

Uploaded Source

Built Distribution

cellarr-0.0.2-py3-none-any.whl (13.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page