Skip to main content

Genomic region based arrays backed by TileDB

Project description

Project generated with PyScaffold

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a GenomicArray

Building a GenomicArray generates 3 TileDB files in the specified output directory:

  • feature_annotation: A TileDB file containing input feature intervals.
  • sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
  • A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "chrom": ["chr1", "chr1"],
     "start": [1000, 2000],
     "end": [1500, 2500]
})

# Build GenomicArray
garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.0.1.tar.gz (97.9 kB view details)

Uploaded Source

Built Distribution

GenomicArrays-0.0.1-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file genomicarrays-0.0.1.tar.gz.

File metadata

  • Download URL: genomicarrays-0.0.1.tar.gz
  • Upload date:
  • Size: 97.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for genomicarrays-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ec60fb5dd708333db190b85f77e5d69ce80a8851b35cce01da136ecb870613cf
MD5 b777ecd344fb12dfbbd97b629a0f9532
BLAKE2b-256 8e46a8dbd29f5e87ab601843eb4ba79043bcaa57a7f7a6a040c32345b9c6c75c

See more details on using hashes here.

File details

Details for the file GenomicArrays-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for GenomicArrays-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b48c318148abefd6216ea1760e63960659ca8035f3b344217c7f97887149946b
MD5 7c55ad66c8eced82d32acec7f41d3a2e
BLAKE2b-256 c2a36bff9169289db92a209cd70c5673f5f3c06f4c76828f484bb0334b35bd8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page