Skip to main content

Genomic region based arrays backed by TileDB

Project description

Project generated with PyScaffold

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a GenomicArray

Building a GenomicArray generates 3 TileDB files in the specified output directory:

  • feature_annotation: A TileDB file containing input feature intervals.
  • sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
  • A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "seqnames": ["chr1", "chr1"],
     "starts": [1000, 2000],
     "ends": [1500, 2500]
})

# Build GenomicArray
dataset = garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # Specify a fasta file to extract sequences
     # for each region in features
     genome_fasta="path/to/genome.fasta",
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Query a GenomicArrayDataset

Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.

# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)

# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]

print(expression_data.matrix)
print(expression_data.feature_annotation)
 ## output 1
 array([[1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , nan]], dtype=float32)

 ## output 2
 seqnames  starts  ends  genarr_feature_index
 0      chr1     300   315                     0
 1      chr1     320   335                     1
 2      chr1     340   355                     2
 3      chr1     360   375                     3
 4      chr1     380   395                     4
 5      chr1     400   415                     5
 6      chr1     420   435                     6
 7      chr1     440   455                     7
 8      chr1     460   475                     8
 9      chr1     480   495                     9
 10     chr1     500   515                    10

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.2.0.tar.gz (100.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

GenomicArrays-0.2.0-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file genomicarrays-0.2.0.tar.gz.

File metadata

  • Download URL: genomicarrays-0.2.0.tar.gz
  • Upload date:
  • Size: 100.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for genomicarrays-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7d2b28d5b13d3a274893346f2cc44a291f94d3ab28a2e5d4d7b0094210b0d75e
MD5 5f5a9975428054f9e1079aac64cd5ebc
BLAKE2b-256 eab38546d57719f836d8d011da66172b4c9b92d23624db963d8c08eb509ff6e1

See more details on using hashes here.

File details

Details for the file GenomicArrays-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: GenomicArrays-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for GenomicArrays-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff37456eb3c5d3e38030d537393c82d50349610d3131303df642742cfd7c6124
MD5 035218a5512f91dd15e66611f3ec3b4c
BLAKE2b-256 6c7630f98b30e7a5457f0d908373c9f79922be2c2e2b261b5df9a1814a2b87fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page