Genomic region based arrays backed by TileDB
Project description
Genomic Arrays based on TileDB
GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.
Installation
Install the package from PyPI
pip install genomicarrays
Quick Start
Build a GenomicArray
Building a GenomicArray
generates 3 TileDB files in the specified output directory:
feature_annotation
: A TileDB file containing input feature intervals.sample_metadata
: A TileDB file containing sample metadata, each BigWig file is considered a sample.- A matrix TileDB file named by the
layer_matrix_name
parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.
The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.
To build a GenomicArray
from a collection of BigWig
files:
import numpy as np
import tempfile
import genomicarrays as garr
# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()
# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]
features = pd.DataFrame({
"seqnames": ["chr1", "chr1"],
"starts": [1000, 2000],
"ends": [1500, 2500]
})
# Build GenomicArray
dataset = garr.build_genomicarray(
files=bw_files,
output_path=tempdir,
features=features,
# Specify a fasta file to extract sequences
# for each region in features
genome_fasta="path/to/genome.fasta",
# agg function to summarize mutiple values
# from bigwig within an input feature interval.
feature_annotation_options=garr.FeatureAnnotationOptions(
aggregate_function = np.nanmean
),
# for parallel processing multiple bigwig files
num_threads=4
)
The build process stores missing intervals from a bigwig file as np.nan
. The
default is to choose an aggregate functions that works with np.nan
.
Query a GenomicArrayDataset
Users have the option to reuse the dataset
object retuned when building the arrays or by creating a GenomicArrayDataset
object by initializing it to the path where the files were created.
# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)
# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]
print(expression_data.matrix)
print(expression_data.feature_annotation)
## output 1
array([[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , nan]], dtype=float32)
## output 2
seqnames starts ends genarr_feature_index
0 chr1 300 315 0
1 chr1 320 335 1
2 chr1 340 355 2
3 chr1 360 375 3
4 chr1 380 395 4
5 chr1 400 415 5
6 chr1 420 435 6
7 chr1 440 455 7
8 chr1 460 475 8
9 chr1 480 495 9
10 chr1 500 515 10
Note
This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genomicarrays-0.0.2.tar.gz
.
File metadata
- Download URL: genomicarrays-0.0.2.tar.gz
- Upload date:
- Size: 100.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0960a272e4773aaddfc3b4a51d1f6a1a7f1b315fa2bda371ad48f0e83fbbb25 |
|
MD5 | 467ee08e637d631b9b8726d08d961006 |
|
BLAKE2b-256 | 12040bb9868ece917f5eb3f0459c2b21fbe19aef86f15e845b64968fa7484c2f |
File details
Details for the file GenomicArrays-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: GenomicArrays-0.0.2-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4506bdd412e06538e48c01f0d20f013ad6f3c72dcc623dc9f21f5fa3d14da2a0 |
|
MD5 | 4f3383efa7211000348a839bf9e31419 |
|
BLAKE2b-256 | 0515111cab156500c00fa0f59dfc6eb4d581d511cc150d0ba497190505332d78 |