Skip to main content

Genomic region based arrays backed by TileDB

Project description

Project generated with PyScaffold

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a GenomicArray

Building a GenomicArray generates 3 TileDB files in the specified output directory:

  • feature_annotation: A TileDB file containing input feature intervals.
  • sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
  • A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "seqnames": ["chr1", "chr1"],
     "starts": [1000, 2000],
     "ends": [1500, 2500]
})

# Build GenomicArray
dataset = garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # Specify a fasta file to extract sequences
     # for each region in features
     genome_fasta="path/to/genome.fasta",
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Query a GenomicArrayDataset

Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.

# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)

# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]

print(expression_data.matrix)
print(expression_data.feature_annotation)
 ## output 1
 array([[1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , nan]], dtype=float32)

 ## output 2
 seqnames  starts  ends  genarr_feature_index
 0      chr1     300   315                     0
 1      chr1     320   335                     1
 2      chr1     340   355                     2
 3      chr1     360   375                     3
 4      chr1     380   395                     4
 5      chr1     400   415                     5
 6      chr1     420   435                     6
 7      chr1     440   455                     7
 8      chr1     460   475                     8
 9      chr1     480   495                     9
 10     chr1     500   515                    10

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.0.2.tar.gz (100.1 kB view details)

Uploaded Source

Built Distribution

GenomicArrays-0.0.2-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file genomicarrays-0.0.2.tar.gz.

File metadata

  • Download URL: genomicarrays-0.0.2.tar.gz
  • Upload date:
  • Size: 100.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for genomicarrays-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a0960a272e4773aaddfc3b4a51d1f6a1a7f1b315fa2bda371ad48f0e83fbbb25
MD5 467ee08e637d631b9b8726d08d961006
BLAKE2b-256 12040bb9868ece917f5eb3f0459c2b21fbe19aef86f15e845b64968fa7484c2f

See more details on using hashes here.

File details

Details for the file GenomicArrays-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for GenomicArrays-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4506bdd412e06538e48c01f0d20f013ad6f3c72dcc623dc9f21f5fa3d14da2a0
MD5 4f3383efa7211000348a839bf9e31419
BLAKE2b-256 0515111cab156500c00fa0f59dfc6eb4d581d511cc150d0ba497190505332d78

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page