Skip to main content

No project description provided

Reason this release was yanked:

Add metadata.

Project description

genome-loader

Pipeline for efficient genomic data processing.

Installation

Recommended installation with conda/mamba:

mamba env create -n genome-loader -f environment.yml

If you want to use the PyTorch datasets, samplers, and other data-oriented classes, install from torch.conda-lock.yml (PyTorch dependency is not included):

mamba env create -n genome-loader torch-environment.yml

A PyTorch dependency is not included since it requires special instructions.

Then, add the package to your Python path e.g. add this line to your .bashrc:

export PYTHONPATH=${PYTHONPATH:+${PYTHONPATH}:}/path/to/genome-loader

 

Table of Contents

 


 

HDF5 Writers

Command line tools for writing genome data to HDF5 format

 

writefasta

Converts Fasta file into char-array(default) or one-hot encoded HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "sequence" if char array, "onehot" if one-hot encoded
  • Attributes: "id"- dataset name associated with file

  Usage

gloader writefasta [FASTA] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • FASTA: Positional argument, fasta file to write to hdf5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

One-Hot Encoding Arguments

  • -e/--encode: Flag that denotes output in one-hot encoding
  • -s/--spec: Ordered string of non-repeating chars. Denotes encoded bases and order ie: "ACGT" (Default: "ACGTN")

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name

 

writefrag

Writes BAM ATAC fragment depth into HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "depth" - 0-based array with depth per position
  • Attributes:
    • "id" - dataset name associated with file
    • "count_method" - method used to count fragments

  Usage

gloader writefrag [BAM] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • BAM: Positional argument, BAM file to parse and write to H5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -l/--lens: Lengths of provided chroms (Auto retrieved if not provided)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name
  • --ignore_offset: Don't offset Tn5 cut sites (+4 bp on + strand, -5 bp on - strand, 0-based)
  • --method: Method used to count fragment. Choice of "cutsite"|"midpoint"|"fragment" (Default: "cutsite")
    • cutsite: Count both Tn5 cut sites
    • midpoint: Count the midpoint between Tn5 cut sites
    • fragment: Count all positions between Tn5 cut sites

 

writecoverage

Writes BAM allelic coverage into HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "coverage" - 4 x N Matrix ordered A, C, G, T showing per allele coverage per position (0-based)
  • Attributes: "id"- dataset name associated with file

  Usage

gloader writecoverage [BAM] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • BAM: Positional argument, BAM file to parse and write to H5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name

 

Python Functions

Python functions for directly loading and parsing genome data.

Specific argument level usage can be found as docstrings within scripts (Located in genome_loader/).

 

encode_data.py

Contains functions for creating one-hot encoded data.

  • encode_sequence: Encodes input data into one-hot encoded format
  • encode_from_fasta: Create one-hot encoded data directly from FASTA
  • encode_from_h5: Create one-hot encoded data from char-array encoded H5

 

get_encoded.py

Contains functions for loading, and transforming one-hot encoded data.

  • get_encoded_haps: Creates one-hot encoded haplotypes from one-hot encoded data

 

get_data.py

Functions that retrieves non-encoded data from files.

  • get_frag_depth: Retrieve fragment depths from a BAM file
  • get_allele_coverage: Retrieve per-allele coverage from BAM file

 

load_data.py

Functions that read non-encoded data from files.

  • load_vcf: Read VCF and load SNP's/Genotypes into dataframe

 

load_h5.py

Functions that load H5 data to python objects.

  • load_onehot_h5: Load onehot encoded genome from H5 to dictionary
  • load_depth_h5: Load read depths from H5 to dictionary
  • load_coverage_h5: Load allele coverage from H5 to dictionary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.0.1.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

genvarloader-0.0.1-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file genvarloader-0.0.1.tar.gz.

File metadata

  • Download URL: genvarloader-0.0.1.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.8 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3cc7410f6a6994eb8369983b0fc0644d943f5d1da44e281102b4998aac0d9b6e
MD5 ddb29148e410e3f2be55777a9c4fea9b
BLAKE2b-256 63fd1c220baa9ac5dd7d6ae9b994eeb261fbe778694448edb56dea861d6bc347

See more details on using hashes here.

File details

Details for the file genvarloader-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: genvarloader-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.8 Linux/4.18.0-477.21.1.el8_8.x86_64

File hashes

Hashes for genvarloader-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ecb9f290e5ecdb00566bc7ae7207884ea31a2eb3c5651d3f2b3d6071c37ff513
MD5 2bd51495798de15a8adeb72c35f2fa6d
BLAKE2b-256 e567abc882a59b26c63f458f9a314e19774c764a144d6dbea520d6ebf6fd8a22

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page