Skip to main content

No project description provided

Reason this release was yanked:

Add metadata.

Project description

genome-loader

Pipeline for efficient genomic data processing.

Installation

Recommended installation with conda/mamba:

mamba env create -n genome-loader -f environment.yml

If you want to use the PyTorch datasets, samplers, and other data-oriented classes, install from torch.conda-lock.yml (PyTorch dependency is not included):

mamba env create -n genome-loader torch-environment.yml

A PyTorch dependency is not included since it requires special instructions.

Then, add the package to your Python path e.g. add this line to your .bashrc:

export PYTHONPATH=${PYTHONPATH:+${PYTHONPATH}:}/path/to/genome-loader

 

Table of Contents

 


 

HDF5 Writers

Command line tools for writing genome data to HDF5 format

 

writefasta

Converts Fasta file into char-array(default) or one-hot encoded HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "sequence" if char array, "onehot" if one-hot encoded
  • Attributes: "id"- dataset name associated with file

  Usage

gloader writefasta [FASTA] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • FASTA: Positional argument, fasta file to write to hdf5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

One-Hot Encoding Arguments

  • -e/--encode: Flag that denotes output in one-hot encoding
  • -s/--spec: Ordered string of non-repeating chars. Denotes encoded bases and order ie: "ACGT" (Default: "ACGTN")

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name

 

writefrag

Writes BAM ATAC fragment depth into HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "depth" - 0-based array with depth per position
  • Attributes:
    • "id" - dataset name associated with file
    • "count_method" - method used to count fragments

  Usage

gloader writefrag [BAM] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • BAM: Positional argument, BAM file to parse and write to H5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -l/--lens: Lengths of provided chroms (Auto retrieved if not provided)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name
  • --ignore_offset: Don't offset Tn5 cut sites (+4 bp on + strand, -5 bp on - strand, 0-based)
  • --method: Method used to count fragment. Choice of "cutsite"|"midpoint"|"fragment" (Default: "cutsite")
    • cutsite: Count both Tn5 cut sites
    • midpoint: Count the midpoint between Tn5 cut sites
    • fragment: Count all positions between Tn5 cut sites

 

writecoverage

Writes BAM allelic coverage into HDF5 file.

  File Format

  • Group: [chrom]
  • Dataset: "coverage" - 4 x N Matrix ordered A, C, G, T showing per allele coverage per position (0-based)
  • Attributes: "id"- dataset name associated with file

  Usage

gloader writecoverage [BAM] --output/--directory [OUT] {OPTIONS}

Required Arguments

  • BAM: Positional argument, BAM file to parse and write to H5
  • -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
  • -d/--directory: Directory to write hdf5 output

Optional Arguments

  • -c/--chroms: Chromosomes to write (Default: ALL)
  • -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name

 

Python Functions

Python functions for directly loading and parsing genome data.

Specific argument level usage can be found as docstrings within scripts (Located in genome_loader/).

 

encode_data.py

Contains functions for creating one-hot encoded data.

  • encode_sequence: Encodes input data into one-hot encoded format
  • encode_from_fasta: Create one-hot encoded data directly from FASTA
  • encode_from_h5: Create one-hot encoded data from char-array encoded H5

 

get_encoded.py

Contains functions for loading, and transforming one-hot encoded data.

  • get_encoded_haps: Creates one-hot encoded haplotypes from one-hot encoded data

 

get_data.py

Functions that retrieves non-encoded data from files.

  • get_frag_depth: Retrieve fragment depths from a BAM file
  • get_allele_coverage: Retrieve per-allele coverage from BAM file

 

load_data.py

Functions that read non-encoded data from files.

  • load_vcf: Read VCF and load SNP's/Genotypes into dataframe

 

load_h5.py

Functions that load H5 data to python objects.

  • load_onehot_h5: Load onehot encoded genome from H5 to dictionary
  • load_depth_h5: Load read depths from H5 to dictionary
  • load_coverage_h5: Load allele coverage from H5 to dictionary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genvarloader-0.0.1.tar.gz (25.5 kB view hashes)

Uploaded Source

Built Distribution

genvarloader-0.0.1-py3-none-any.whl (30.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page