Filters .gtf file of suspected HIV isoforms and confirms the isoform identities
Project description
HIV Isoform Checker
This package takes a .gtf file of preliminarily filtered HIV transcripts from ONT sequencing and filters them to include only correctly assigned transcripts using the following filters.
- FILTER 1: only include class codes =, J, and m
- FILTER 2: only include samples with end values >= min_end_bp and start values <= max_start_bp
- FILTER 3: get rid of any samples with read errors/small gaps
- FILTER 4: keep only correct Env samples
- FILTER 5: keep only correct Nef samples(long samples added to possible_misassigned)
- FILTER 6: keep only correct Rev samples(long samples added to possible_misassigned)
- FILTER 7: keep only correct Tat samples(long samples added to possible_misassigned)
- FILTER 8: keep only correct Vif samples
- FILTER 9: keep only correct Vpr samples
- FILTER 10: check possible_misassigned for partial splice compatibility (vif -> vpr -> unslpiced_tat -> env)
Note: This code currently relies on a very specific setup of the gtf file to work properly. The note must be in the order designated in the []. transcript entry = ref genome, analysis_pathway, transcript, start, end, ".", "+", ".", [transcript id; gene id; gene_name; xloc; ref_gene_id; contained_in; cmp_ref; class_code; tss_id] exon entry = ref genome, analysis_pathway, exon, start, end, ".", "+", ".", [transcript id; gene id; exon number]
Installation
Fast install:
pip install HIV_Isoform_Checker
Usage
HIV_Isoform_Checker [options] input_file_name output_file_prefix ref_file_name
positional arguments: input_file_name Designates input file to be filtered. This is required. output_file_prefix Designates output file prefix. This is required. ref_file_name Designates reference CDS file name. This should be a python file with only a dictionary with the splice donor sites, splice acceptor sites and gene CDS regions defined. This is required. An example is available in the test data set.
options:
Arguement | Function |
---|---|
-h, --help | show this help message and exit |
-g value, --gap value | Sets gap tolerance. Default is 15. |
-a value, --startBP value | Sets maximum starting bp. Default is 700. |
-z value, --endBP value | Sets minimum ending bp. Default is 9500. |
-l value, --lengthFS value | Sets maximum fully spliced transcript length. Default is 2100. |
-n value, --NCE value | When set to True, csv file will have y/n columns for the precence of NCEs. Default is False. |
License
MIT - Copyright (c) 2023 Jessica Lauren Albert
HIV_Isoform_Checker
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hiv_isoform_checker-1.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07a5623e3feb250b205a769da8445ed2ae42225a15575f7a474f5b83ed37012d |
|
MD5 | 4cb3daf354336fac9d9e3745efb18d6a |
|
BLAKE2b-256 | 088a324b187050adde692d071cfd682223d3de9042d0f242d0706bd22ec70fd4 |
Hashes for HIV_Isoform_Checker-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff379595f8f7d71b768c12129a9611fd6a80eedbea2341ca2493829e486357a0 |
|
MD5 | e08b3cfd46b9607544c11959022bf711 |
|
BLAKE2b-256 | 04cbc858104cceaf72f97354d69345c3a8f98bbc52c2b59d65cdf13938d31b5a |