Skip to main content

Genomic region based arrays backed by TileDB

Project description

Project generated with PyScaffold

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a GenomicArray

Building a GenomicArray generates 3 TileDB files in the specified output directory:

  • feature_annotation: A TileDB file containing input feature intervals.
  • sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
  • A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "seqnames": ["chr1", "chr1"],
     "starts": [1000, 2000],
     "ends": [1500, 2500]
})

# Build GenomicArray
dataset = garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # Specify a fasta file to extract sequences
     # for each region in features
     genome_fasta="path/to/genome.fasta",
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Query a GenomicArrayDataset

Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.

# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)

# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]

print(expression_data.matrix)
print(expression_data.feature_annotation)
 ## output 1
 array([[1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , nan]], dtype=float32)

 ## output 2
 seqnames  starts  ends  genarr_feature_index
 0      chr1     300   315                     0
 1      chr1     320   335                     1
 2      chr1     340   355                     2
 3      chr1     360   375                     3
 4      chr1     380   395                     4
 5      chr1     400   415                     5
 6      chr1     420   435                     6
 7      chr1     440   455                     7
 8      chr1     460   475                     8
 9      chr1     480   495                     9
 10     chr1     500   515                    10

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.0.3.tar.gz (101.0 kB view details)

Uploaded Source

Built Distribution

GenomicArrays-0.0.3-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file genomicarrays-0.0.3.tar.gz.

File metadata

  • Download URL: genomicarrays-0.0.3.tar.gz
  • Upload date:
  • Size: 101.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for genomicarrays-0.0.3.tar.gz
Algorithm Hash digest
SHA256 188d52f1e5d8680bce70db2e118eacf13f0e631fb90326ea212675de80d932f4
MD5 52adab43294c6d7c28a6607cd5fca436
BLAKE2b-256 fa438e94c45727628cb1eab20920559dcb5705cf472dcd8282b78b7ed9ecbf3f

See more details on using hashes here.

File details

Details for the file GenomicArrays-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for GenomicArrays-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fe5cfc2438d5be5d15b5b788b416944af32045c4003eb7571f3fbe89ad30d8ea
MD5 3a2cb2166114630a2b8b3c6b2e8a247e
BLAKE2b-256 614871450b17180202707038b72b9fff34d8391f54e2e1d0eaa3d855f8682c0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page