Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.
Project description
GetTransTool Package
There are four types of methods to extract longest transcript or longest CDS regeion with longest transcript from transcripts fasta file or GTF file.
-
1.Extract longest transcript from gencode transcripts fasta file.
-
2.Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database.
-
3.Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.
-
4.Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.
Install
$ pip install GetTransTool
Usage
1. get longest transcript from gencode transcripts fasta file:
help infomation:
$ GetLongestTransFromGencode -h
usage: GetLongestTransFromGencode --file gencode.vM28.transcripts.fa.gz --outfile longest_trans.fa
Get longest transcripts from gencode transcripts fasta file.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-f transfile, --file transfile
input your transcripts file with ".gz" format. (gencode.vM28.transcripts.fa.gz)
-o longestfile, --outfile longestfile
output your longest transcript file. (longest_trans.fa)
Thank your for your support, if you have any questions or suggestions please contact me: 3219030654@stu.cpu.edu.cn.
usage:
$ GetLongestTransFromGencode --file gencode.vM28.transcripts.fa.gz --outfile longest_trans_gencode.fa
Your job is running, please wait...
Your job is done!
Running with 32.33 seconds!
there will be three files produced including name_changed.fa, longest_transcripts_info.csv, longest_trans_gencode.fa.
name_changed.fa:
>4933401J01Rik|ENSMUSG00000102693.2|ENSMUST00000193812.2|1070
AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT
ATAATCAAGGAAAAGACCTTTGCATATAAAATATATTTTGAATAAAATTCAGTGGAAGAA
...
longest_transcripts_info.csv:
this is the longest transcripts exon length information.
fullname,gene_name,translength
snoZ196_ENSMUSG00002074855.1|ENSMUST00020182568.1|35,snoZ196,35
snoZ159_ENSMUSG00002075734.1|ENSMUST00020182611.1|87,snoZ159,87
n-R5s93_ENSMUSG00000119639.1|ENSMUST00000240071.1|119,n-R5s93,119
...
longest_trans_gencode.fa:
this is the filtered longest transcript fasta file.
>4933401J01Rik|ENSMUSG00000102693.2|ENSMUST00000193812.2|1070
AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT
ATAATCAAGGAAAAGACCTTTGCATATAAAATATATTTTGAATAAAATTCAGTGGAAGAA
...
2. Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database:
help infomation:
$ GetLongestTransFromGTF -h
usage: GetLongestTransFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.101.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa --outfile longest_trans.fa
Extract longest transcript from gtf format annotation file based on gencode/ensembl/ucsc database.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-d databse, --database databse
which annotation database you choose. (default="ensembl", ucsc/ensembl/gencode)
-g gtffile, --gtffile gtffile
input your GTF file with ".gz" format.
-fa genome, --genome genome
your genome fasta file matched with your GTF file with ".gz" format. (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)
-o longestfile, --outfile longestfile
output your longest transcript file. (longest_trans.fa)
Thank your for your support, if you have any questions or suggestions please contact me: 3219030654@stu.cpu.edu.cn.
usage:
$ GetLongestTransFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.103.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa --outfile longest_trans_ensembl.fa
Your job is running, please wait...
Your job is done!
Running with 159.51 seconds!
for ucsc:
$ GetLongestTransFromGTF --database ucsc --gtffile hg19.ncbiRefSeq.gtf.gz --genome hg19.fa --outfile longest_trans_ucsc.fa
3. Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.
help infomation:
$ GetCDSLongestFromGencode -h
usage: GetCDSLongestFromGencode --file gencode.vM28.pc_transcripts.fa.gz --outfile longest_cds_trans.fa
Extract longest CDS regeion with longest transcript from gencode database transcripts fasta file.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-f transfile, --file transfile
input your protein-coding transcripts file with ".gz" format. (gencode.vM28.pc_transcripts.fa.gz)
-o longestfile, --outfile longestfile
output your longest transcript file. (longest_cds_trans.fa)
Thank your for your support, if you have any questions or suggestions please contact me: 3219030654@stu.cpu.edu.cn.
usage:
$ GetCDSLongestFromGencode --file gencode.vM28.pc_transcripts.fa.gz --outfile longest_cds_trans_gencode.fa
Your job is running, please wait...
Your job is done!
Running with 17.67 seconds!
there will be four files produced including name_changed.fa, All_transcripts_cds_info.csv, longest_cds_transcripts_info.csv, longest_cds_trans_gencode.fa.
name_changed.fa:
>Xkr4_ENSMUSG00000051951.6_ENSMUST00000070533.5_151_2094_3634
GCGGCGGCGGGCGAGCGGGCGCTGGAGTAGGAGCTGGGGAGCGGCGCGGCCGGGGAAGGA
AGCCAGGGCGAGGCGAGGAGGTGGCGGGAGGAGGAGACAGCAGGGACAGGTGTCAGATAA
AGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAG
...
All_transcripts_cds_info.csv:
this is the all transcripts cds and exon length information.
fullname,gene_name,translength,cdslength
>mt-Nd6_ENSMUSG00000064368.1_ENSMUST00000082419.1_1_519_519,>mt-Nd6,519,519
>mt-Nd5_ENSMUSG00000064367.1_ENSMUST00000082418.1_1_1824_1824,>mt-Nd5,1824,1824
>mt-Nd4l_ENSMUSG00000065947.1_ENSMUST00000084013.1_1_297_297,>mt-Nd4l,297,297
...
longest_cds_transcripts_info.csv:
fullname,gene_name,translength,cdslength
>mt-Nd6_ENSMUSG00000064368.1_ENSMUST00000082419.1_1_519_519,>mt-Nd6,519,519
>mt-Nd5_ENSMUSG00000064367.1_ENSMUST00000082418.1_1_1824_1824,>mt-Nd5,1824,1824
>mt-Nd4l_ENSMUSG00000065947.1_ENSMUST00000084013.1_1_297_297,>mt-Nd4l,297,297
...
longest_cds_trans_gencode.fa:
>Xkr4_ENSMUSG00000051951.6_ENSMUST00000070533.5_151_2094_3634
GCGGCGGCGGGCGAGCGGGCGCTGGAGTAGGAGCTGGGGAGCGGCGCGGCCGGGGAAGGA
AGCCAGGGCGAGGCGAGGAGGTGGCGGGAGGAGGAGACAGCAGGGACAGGTGTCAGATAA
AGGAGTGCTCTCCTCCGCTGCCGAGGCATCATGGCCGCTAAGTCAGACGGGAGGCTGAAG
...
4. Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.
help infomation:
$ GetCDSLongestFromGTF -h
usage: GetCDSLongestFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.101.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_cds_trans.fa
Extract longest CDS regeion with longest transcript from gtf format annotation file based on ensembl/ucsc database.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-d databse, --database databse
which annotation database you choose. (default="ensembl", ucsc/ensembl)
-g gtffile, --gtffile gtffile
input your GTF file with ".gz" format.
-fa genome, --genome genome
your genome fasta file matched with your GTF file with ".gz" format. (Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz)
-o cdslongestfile, --outfile cdslongestfile
output your longest transcript file. (longest_cds_trans.fa)
Thank your for your support, if you have any questions or suggestions please contact me: 3219030654@stu.cpu.edu.cn.
usage:
$ GetCDSLongestFromGTF --database ensembl --gtffile Homo_sapiens.GRCh38.103.gtf.gz --genome Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --outfile longest_cds_trans_ensembl.fa
Your job is running, please wait...
Your job is done!
Running with 152.38 seconds!
for ucsc:
$ GetCDSLongestFromGTF --database ucsc --gtffile hg19.ncbiRefSeq.gtf.gz --genome hg19.fa.gz --outfile longest_cds_trans_ensembl.fa
END
Thank your for your support, if you have any questions or suggestions please contact me: 3219030654@stu.cpu.edu.cn.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for GetTransTool-0.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cc1b4a3f5e0d2c2eb3f7e6ae316ad4fb8ab72d03d2eb695f25baf2076eb2e92 |
|
MD5 | 6b71787b9b7b09528c2a44442d97784c |
|
BLAKE2b-256 | fccf4212ba8b0502025cb5f810ffec6b0a10001c20064e8342847be0fba977fb |