Genomic region based arrays backed by TileDB
Project description
Genomic Arrays based on TileDB
GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.
Installation
Install the package from PyPI
pip install genomicarrays
Quick Start
Build a GenomicArray
Building a GenomicArray generates 3 TileDB files in the specified output directory:
feature_annotation: A TileDB file containing input feature intervals.sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.- A matrix TileDB file named by the
layer_matrix_nameparameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.
The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.
To build a GenomicArray from a collection of BigWig files:
import numpy as np
import tempfile
import genomicarrays as garr
# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()
# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]
features = pd.DataFrame({
"seqnames": ["chr1", "chr1"],
"starts": [1000, 2000],
"ends": [1500, 2500]
})
# Build GenomicArray
dataset = garr.build_genomicarray(
files=bw_files,
output_path=tempdir,
features=features,
# Specify a fasta file to extract sequences
# for each region in features
genome_fasta="path/to/genome.fasta",
# agg function to summarize mutiple values
# from bigwig within an input feature interval.
feature_annotation_options=garr.FeatureAnnotationOptions(
aggregate_function = np.nanmean
),
# for parallel processing multiple bigwig files
num_threads=4
)
[!NOTE]
- The aggregate function is expected to return either a scalar value or a 1-dimensional NumPy ndarray. If the later, users need to specify the expected dimension of the return array. e.g.
feature_annotation_options=garr.FeatureAnnotationOptions( aggregate_function = my_custom_func, expected_agg_function_length = 10, ),- The build process stores missing intervals from a bigwig file as
np.nan. The default is to choose an aggregate functions that works withnp.nan.
Query a GenomicArrayDataset
Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.
# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)
# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]
print(expression_data.matrix)
print(expression_data.feature_annotation)
## output 1
array([[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , 0.5],
[1. , nan]], dtype=float32)
## output 2
seqnames starts ends genarr_feature_index
0 chr1 300 315 0
1 chr1 320 335 1
2 chr1 340 355 2
3 chr1 360 375 3
4 chr1 380 395 4
5 chr1 400 415 5
6 chr1 420 435 6
7 chr1 440 455 7
8 chr1 460 475 8
9 chr1 480 495 9
10 chr1 500 515 10
Note
This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genomicarrays-0.2.2.tar.gz.
File metadata
- Download URL: genomicarrays-0.2.2.tar.gz
- Upload date:
- Size: 113.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27a974dbd1d46907f460e9fcebe483815ba88b55a5adc9bc378235794ff19a97
|
|
| MD5 |
2627bb340a207b8d4a5e5e0728032818
|
|
| BLAKE2b-256 |
c7806d162c08865d481ac8356294b672fc70afd471ca52c72a28e6e2ed890987
|
Provenance
The following attestation bundles were made for genomicarrays-0.2.2.tar.gz:
Publisher:
publish-pypi.yml on CellArr/GenomicArrays
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genomicarrays-0.2.2.tar.gz -
Subject digest:
27a974dbd1d46907f460e9fcebe483815ba88b55a5adc9bc378235794ff19a97 - Sigstore transparency entry: 166990406
- Sigstore integration time:
-
Permalink:
CellArr/GenomicArrays@e9e0eb6320619efa0ce547004868a88dab515de1 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/CellArr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@e9e0eb6320619efa0ce547004868a88dab515de1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file GenomicArrays-0.2.2-py3-none-any.whl.
File metadata
- Download URL: GenomicArrays-0.2.2-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a97007a61f2e0bc0efd569f8f1877704687c0a3a3b5563c6495df8e6bb04c9d5
|
|
| MD5 |
cf55ce32f0b803cd4e136a9160043aaf
|
|
| BLAKE2b-256 |
4b20b47f2639674db48f5f0a639044046f139b7dadb18fade8595351df99e0f7
|
Provenance
The following attestation bundles were made for GenomicArrays-0.2.2-py3-none-any.whl:
Publisher:
publish-pypi.yml on CellArr/GenomicArrays
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genomicarrays-0.2.2-py3-none-any.whl -
Subject digest:
a97007a61f2e0bc0efd569f8f1877704687c0a3a3b5563c6495df8e6bb04c9d5 - Sigstore transparency entry: 166990407
- Sigstore integration time:
-
Permalink:
CellArr/GenomicArrays@e9e0eb6320619efa0ce547004868a88dab515de1 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/CellArr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@e9e0eb6320619efa0ce547004868a88dab515de1 -
Trigger Event:
push
-
Statement type: