No project description provided
Reason this release was yanked:
Add metadata.
Project description
genome-loader
Pipeline for efficient genomic data processing.
Installation
Recommended installation with conda/mamba:
mamba env create -n genome-loader -f environment.yml
If you want to use the PyTorch datasets, samplers, and other data-oriented classes, install from torch.conda-lock.yml
(PyTorch dependency is not included):
mamba env create -n genome-loader torch-environment.yml
A PyTorch dependency is not included since it requires special instructions.
Then, add the package to your Python path e.g. add this line to your .bashrc
:
export PYTHONPATH=${PYTHONPATH:+${PYTHONPATH}:}/path/to/genome-loader
Table of Contents
HDF5 Writers
Command line tools for writing genome data to HDF5 format
writefasta
Converts Fasta file into char-array(default) or one-hot encoded HDF5 file.
File Format
- Group:
[chrom]
- Dataset:
"sequence"
if char array,"onehot"
if one-hot encoded - Attributes:
"id"
- dataset name associated with file
Usage
gloader writefasta [FASTA] --output/--directory [OUT] {OPTIONS}
Required Arguments
- FASTA: Positional argument, fasta file to write to hdf5
- -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
- -d/--directory: Directory to write hdf5 output
One-Hot Encoding Arguments
- -e/--encode: Flag that denotes output in one-hot encoding
- -s/--spec: Ordered string of non-repeating chars. Denotes encoded bases and order ie: "ACGT" (Default: "ACGTN")
Optional Arguments
- -c/--chroms: Chromosomes to write (Default: ALL)
- -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name
writefrag
Writes BAM ATAC fragment depth into HDF5 file.
File Format
- Group:
[chrom]
- Dataset:
"depth"
- 0-based array with depth per position - Attributes:
"id"
- dataset name associated with file"count_method"
- method used to count fragments
Usage
gloader writefrag [BAM] --output/--directory [OUT] {OPTIONS}
Required Arguments
- BAM: Positional argument, BAM file to parse and write to H5
- -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
- -d/--directory: Directory to write hdf5 output
Optional Arguments
- -c/--chroms: Chromosomes to write (Default: ALL)
- -l/--lens: Lengths of provided chroms (Auto retrieved if not provided)
- -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name
- --ignore_offset: Don't offset Tn5 cut sites (+4 bp on + strand, -5 bp on - strand, 0-based)
- --method: Method used to count fragment. Choice of
"cutsite"
|"midpoint"
|"fragment"
(Default:"cutsite"
)cutsite
: Count both Tn5 cut sitesmidpoint
: Count the midpoint between Tn5 cut sitesfragment
: Count all positions between Tn5 cut sites
writecoverage
Writes BAM allelic coverage into HDF5 file.
File Format
- Group:
[chrom]
- Dataset:
"coverage"
- 4 x N Matrix ordered A, C, G, T showing per allele coverage per position (0-based) - Attributes:
"id"
- dataset name associated with file
Usage
gloader writecoverage [BAM] --output/--directory [OUT] {OPTIONS}
Required Arguments
- BAM: Positional argument, BAM file to parse and write to H5
- -o/--output: Full path and file name of output (NOTE: Cannot use both -o and -d flags)
- -d/--directory: Directory to write hdf5 output
Optional Arguments
- -c/--chroms: Chromosomes to write (Default: ALL)
- -n/--name: Output file if --directory given, ignored if using --output flag. Defaults to input fasta name
Python Functions
Python functions for directly loading and parsing genome data.
Specific argument level usage can be found as docstrings within scripts (Located in genome_loader/
).
encode_data.py
Contains functions for creating one-hot encoded data.
- encode_sequence: Encodes input data into one-hot encoded format
- encode_from_fasta: Create one-hot encoded data directly from FASTA
- encode_from_h5: Create one-hot encoded data from char-array encoded H5
get_encoded.py
Contains functions for loading, and transforming one-hot encoded data.
- get_encoded_haps: Creates one-hot encoded haplotypes from one-hot encoded data
get_data.py
Functions that retrieves non-encoded data from files.
- get_frag_depth: Retrieve fragment depths from a BAM file
- get_allele_coverage: Retrieve per-allele coverage from BAM file
load_data.py
Functions that read non-encoded data from files.
- load_vcf: Read VCF and load SNP's/Genotypes into dataframe
load_h5.py
Functions that load H5 data to python objects.
- load_onehot_h5: Load onehot encoded genome from H5 to dictionary
- load_depth_h5: Load read depths from H5 to dictionary
- load_coverage_h5: Load allele coverage from H5 to dictionary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genvarloader-0.0.1.tar.gz
.
File metadata
- Download URL: genvarloader-0.0.1.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.8 Linux/4.18.0-477.21.1.el8_8.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cc7410f6a6994eb8369983b0fc0644d943f5d1da44e281102b4998aac0d9b6e |
|
MD5 | ddb29148e410e3f2be55777a9c4fea9b |
|
BLAKE2b-256 | 63fd1c220baa9ac5dd7d6ae9b994eeb261fbe778694448edb56dea861d6bc347 |
File details
Details for the file genvarloader-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: genvarloader-0.0.1-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.8 Linux/4.18.0-477.21.1.el8_8.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecb9f290e5ecdb00566bc7ae7207884ea31a2eb3c5651d3f2b3d6071c37ff513 |
|
MD5 | 2bd51495798de15a8adeb72c35f2fa6d |
|
BLAKE2b-256 | e567abc882a59b26c63f458f9a314e19774c764a144d6dbea520d6ebf6fd8a22 |