Skip to main content

Genomic region based arrays backed by TileDB

Project description

PyPI-Server Unit tests

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a GenomicArray

Building a GenomicArray generates 3 TileDB files in the specified output directory:

  • feature_annotation: A TileDB file containing input feature intervals.
  • sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
  • A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "seqnames": ["chr1", "chr1"],
     "starts": [1000, 2000],
     "ends": [1500, 2500]
})

# Build GenomicArray
dataset = garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # Specify a fasta file to extract sequences
     # for each region in features
     genome_fasta="path/to/genome.fasta",
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

[!NOTE]

  • The aggregate function is expected to return either a scalar value or a 1-dimensional NumPy ndarray. If the later, users need to specify the expected dimension of the return array. e.g.
          feature_annotation_options=garr.FeatureAnnotationOptions(
                aggregate_function = my_custom_func,
                expected_agg_function_length = 10,
           ),
    
  • The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Query a GenomicArrayDataset

Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.

# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)

# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]

print(expression_data.matrix)
print(expression_data.feature_annotation)
 ## output 1
 array([[1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , nan]], dtype=float32)

 ## output 2
 seqnames  starts  ends  genarr_feature_index
 0      chr1     300   315                     0
 1      chr1     320   335                     1
 2      chr1     340   355                     2
 3      chr1     360   375                     3
 4      chr1     380   395                     4
 5      chr1     400   415                     5
 6      chr1     420   435                     6
 7      chr1     440   455                     7
 8      chr1     460   475                     8
 9      chr1     480   495                     9
 10     chr1     500   515                    10

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.2.2.tar.gz (113.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

GenomicArrays-0.2.2-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file genomicarrays-0.2.2.tar.gz.

File metadata

  • Download URL: genomicarrays-0.2.2.tar.gz
  • Upload date:
  • Size: 113.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for genomicarrays-0.2.2.tar.gz
Algorithm Hash digest
SHA256 27a974dbd1d46907f460e9fcebe483815ba88b55a5adc9bc378235794ff19a97
MD5 2627bb340a207b8d4a5e5e0728032818
BLAKE2b-256 c7806d162c08865d481ac8356294b672fc70afd471ca52c72a28e6e2ed890987

See more details on using hashes here.

Provenance

The following attestation bundles were made for genomicarrays-0.2.2.tar.gz:

Publisher: publish-pypi.yml on CellArr/GenomicArrays

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file GenomicArrays-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: GenomicArrays-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for GenomicArrays-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a97007a61f2e0bc0efd569f8f1877704687c0a3a3b5563c6495df8e6bb04c9d5
MD5 cf55ce32f0b803cd4e136a9160043aaf
BLAKE2b-256 4b20b47f2639674db48f5f0a639044046f139b7dadb18fade8595351df99e0f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for GenomicArrays-0.2.2-py3-none-any.whl:

Publisher: publish-pypi.yml on CellArr/GenomicArrays

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page