Genomic region based arrays backed by TileDB

These details have not been verified by PyPI

Project links

Project description

Genomic Arrays based on TileDB

GenomicArrays is a Python package for converting genomic data from BigWig format to TileDB arrays.

Installation

Install the package from PyPI

pip install genomicarrays

Quick Start

Build a `GenomicArray`

Building a GenomicArray generates 3 TileDB files in the specified output directory:

feature_annotation: A TileDB file containing input feature intervals.
sample_metadata: A TileDB file containing sample metadata, each BigWig file is considered a sample.
A matrix TileDB file named by the layer_matrix_name parameter. This allows the package to store multiple different matrices, e.g. 'coverage', 'some_computed_statistic', for the same interval, and sample metadata attributes.

The organization is inspired by the SummarizedExperiment data structure. The TileDB matrix file is stored in a features X samples orientation.

GenomicArray structure

To build a GenomicArray from a collection of BigWig files:

import numpy as np
import tempfile
import genomicarrays as garr

# Create a temporary directory, this is where the
# output files are created. Pick your location here.
tempdir = tempfile.mkdtemp()

# List BigWig paths
bw_dir = "your/biwig/dir"
files = os.listdir(bw_dir)
bw_files = [f"{bw_dir}/{f}" for f in files]

features = pd.DataFrame({
     "seqnames": ["chr1", "chr1"],
     "starts": [1000, 2000],
     "ends": [1500, 2500]
})

# Build GenomicArray
dataset = garr.build_genomicarray(
     files=bw_files,
     output_path=tempdir,
     features=features,
     # Specify a fasta file to extract sequences
     # for each region in features
     genome_fasta="path/to/genome.fasta",
     # agg function to summarize mutiple values
     # from bigwig within an input feature interval.
     feature_annotation_options=garr.FeatureAnnotationOptions(
        aggregate_function = np.nanmean
     ),
     # for parallel processing multiple bigwig files
     num_threads=4
)

The build process stores missing intervals from a bigwig file as np.nan. The default is to choose an aggregate functions that works with np.nan.

Query a `GenomicArrayDataset`

Users have the option to reuse the dataset object retuned when building the arrays or by creating a GenomicArrayDataset object by initializing it to the path where the files were created.

# Create a GenomicArrayDataset object from the existing dataset
dataset = GenomicArrayDataset(dataset_path=tempdir)

# Query data for the first 10 regions across all samples
coverage_data = dataset[0:10, :]

print(expression_data.matrix)
print(expression_data.feature_annotation)

 ## output 1
 array([[1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , 0.5],
      [1. , nan]], dtype=float32)

 ## output 2
 seqnames  starts  ends  genarr_feature_index
 0      chr1     300   315                     0
 1      chr1     320   335                     1
 2      chr1     340   355                     2
 3      chr1     360   375                     3
 4      chr1     380   395                     4
 5      chr1     400   415                     5
 6      chr1     420   435                     6
 7      chr1     440   455                     7
 8      chr1     460   475                     8
 9      chr1     480   495                     9
 10     chr1     500   515                    10

Note

This project has been set up using PyScaffold 4.6. For details and usage information on PyScaffold see https://pyscaffold.org/.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Jan 30, 2025

0.2.1

Jan 29, 2025

This version

0.2.0

Dec 20, 2024

0.0.3

Nov 20, 2024

0.0.2

Nov 6, 2024

0.0.1

Nov 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomicarrays-0.2.0.tar.gz (100.7 kB view details)

Uploaded Dec 20, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

GenomicArrays-0.2.0-py3-none-any.whl (19.3 kB view details)

Uploaded Dec 20, 2024 Python 3

File details

Details for the file genomicarrays-0.2.0.tar.gz.

File metadata

Download URL: genomicarrays-0.2.0.tar.gz
Upload date: Dec 20, 2024
Size: 100.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for genomicarrays-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7d2b28d5b13d3a274893346f2cc44a291f94d3ab28a2e5d4d7b0094210b0d75e`
MD5	`5f5a9975428054f9e1079aac64cd5ebc`
BLAKE2b-256	`eab38546d57719f836d8d011da66172b4c9b92d23624db963d8c08eb509ff6e1`

See more details on using hashes here.

File details

Details for the file GenomicArrays-0.2.0-py3-none-any.whl.

File metadata

Download URL: GenomicArrays-0.2.0-py3-none-any.whl
Upload date: Dec 20, 2024
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for GenomicArrays-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ff37456eb3c5d3e38030d537393c82d50349610d3131303df642742cfd7c6124`
MD5	`035218a5512f91dd15e66611f3ec3b4c`
BLAKE2b-256	`6c7630f98b30e7a5457f0d908373c9f79922be2c2e2b261b5df9a1814a2b87fb`

See more details on using hashes here.

GenomicArrays 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Genomic Arrays based on TileDB

Installation

Quick Start

Build a `GenomicArray`

Query a `GenomicArrayDataset`

Note

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

GenomicArrays 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Genomic Arrays based on TileDB

Installation

Quick Start

Build a GenomicArray

Query a GenomicArrayDataset

Note

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Build a `GenomicArray`

Query a `GenomicArrayDataset`