Skip to main content

Genomic Position with Python (GPPY)

Project description

Genomic Positioning with Python

gppy is a light-weight (no third-party dependencies) and easy-to-install python package for genomic interval conversions to facilitate related transcriptome or translatome analysis.

Main features include:

  • convert transcript/CDS coordinates/intervals to genomic coordinates/intervals in bed12 format and vice versa, while taking well care of the presence of introns.
  • extract mRNA/CDS/UTR intervals from gtf and export in bed12 format.
  • extract metadata from gtf files (including gene names, biotypes, and canonical status, transcript/CDS/UTR lengths) and export in tabular format.

Installation

pip install gppy

# alternatively, download wheel and install
pip install gppy-version-py3-none-any.whl

Run without installation

Scripts in this package rely only on the standard python (tested with version >= 3.7). No third party dependency is required. All the scripts can be run from the command line without installation after downloading.

wget https://raw.githubusercontent.com/mt1022/gppy/main/gppy/gtf.py

To run gppy:

# as package
gppy subcommand -h

# script
python gppy/gtf.py subcommand -h

How to specify coordinates when using gppy?

  • Genomic positions (gpos), genomic intervals (giv), transcriptomic positions (tpos), and transcriptomic intervals (tiv) are all 1-based. For example, if a region spans fifth nucleotide to tenth nucleotide of an mRNA, tiv should be (5, 10) and the tpos of the first nucleotide in this region is 5.
  • bed files generated by gppy present genomic regions with zero-based half-open intervals, following common practices. For example, if a region spans fifth nucleotide to tenth nucleotide of chr1, the first three columns for this region in bed will be chr1 4 10.

Examples

Extract transcript length stats and metadata:

gppy txinfo -g test/human.chrY.gtf >test/human.chrY.txinfo.tsv
cut -f1-9,12,15,19-22 test/human.chrY.txinfo.tsv | head
# tx_name	gene_id	chrom	strand	nexon	tx_len	cds_len	utr5_len	utr3_len	gene_name	transcript_biotype	ccds	ensembl_canonical	mane_select	basic
# ENST00000431340	ENSG00000215601	Y	+	4	443	0	0	0	TSPY24P	unprocessed_pseudogene	False	True	False	True
# ENST00000415010	ENSG00000215603	Y	-	1	1191	0	0	0	ZNF92P1Y	processed_pseudogene	False	True	False	True
# ENST00000449381	ENSG00000231436	Y	-	8	1145	0	0	0	RBMY3AP	unprocessed_pseudogene	False	True	False	True
# ENST00000436888	ENSG00000225878	Y	-	1	1164	0	0	0	SERBP1P2	processed_pseudogene	False	True	False	True
# ENST00000421279	ENSG00000236435	Y	-	5	868	0	0	0	TSPY12P	unprocessed_pseudogene	False	True	False	True
# ENST00000430032	ENSG00000278478	Y	+	1	279	0	0	0		processed_pseudogene	False	True	False	True
# ENST00000557448	ENSG00000258991	Y	+	1	1267	0	0	0	DUX4L19	unprocessed_pseudogene	False	True	False	True
# ENST00000651670	ENSG00000237048	Y	+	4	1123	0	0	0	TTTY12	lncRNA	False	False	False	True
# ENST00000413466	ENSG00000237048	Y	+	3	1046	0	0	0	TTTY12	lncRNA	False	True	False	False

Note: if your GTF file is not formatted as those in ENSEMBL Genome Browser, gppy may fail when trying to extract metadata. In such cases, you can try gppy txinfo_basic to get only basic information including name, id, chrom, strand, and length-related features.

Extract CDS regions of each protein-coding transcript and export in bed12 format

gppy convert2bed -g test/human.chrY.gtf -t cds >test/human.chrY.cds.bed12
head test/human.chrY.cds.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,

Convert CDS regions in genome coordinates to transcriptome coordinates

awk -v OFS="\t" '{print $4, $2 + 1, $3, $6}' test/human.chrY.cds.bed12 >test/human.chrY.cds.giv.tsv
gppy giv2tiv -g test/human.chrY.gtf -i test/human.chrY.cds.giv.tsv >test/human.chrY.cds.tiv.tsv

head test/human.chrY.cds.giv.tsv
# ENST00000303728	22501565	22514067	+
# ENST00000477123	22501565	22512665	+
# ENST00000651177	12709448	12859413	+
# ENST00000338981	12709448	12859413	+
# ENST00000453031	12847045	12859413	+
# ENST00000303804	22072326	22084839	-
# ENST00000472391	22073731	22084839	-
# ENST00000361365	20575872	20592343	+
# ENST00000382772	20575872	20592343	+
# ENST00000602732	22992344	22992376	+

head test/human.chrY.cds.tiv.tsv
# ENST00000303728	22501565	22514067	+	228	668	exon	exon
# ENST00000477123	22501565	22512665	+	228	440	exon	exon
# ENST00000651177	12709448	12859413	+	587	8251	exon	exon
# ENST00000338981	12709448	12859413	+	946	8610	exon	exon
# ENST00000453031	12847045	12859413	+	1	710	exon	exon
# ENST00000303804	22072326	22084839	-	228	668	exon	exon
# ENST00000472391	22073731	22084839	-	228	440	exon	exon
# ENST00000361365	20575872	20592343	+	97	528	exon	exon
# ENST00000382772	20575872	20592343	+	79	459	exon	exon
# ENST00000602732	22992344	22992376	+	527	559	exon	exon

Convert CDS regions in transcriptome coordinates to genome coordinates

cut -f1,5,6 test/human.chrY.cds.tiv.tsv >test/human.chrY.cds.tiv2.tsv
gppy tiv2giv -g test/human.chrY.gtf -i test/human.chrY.cds.tiv2.tsv -a >test/human.chrY.cds.giv2.bed12

head test/human.chrY.cds.tiv2.tsv
# ENST00000303728	228	668
# ENST00000477123	228	440
# ENST00000651177	587	8251
# ENST00000338981	946	8610
# ENST00000453031	1	710
# ENST00000303804	228	668
# ENST00000472391	228	440
# ENST00000361365	97	528
# ENST00000382772	79	459
# ENST00000602732	527	559

head test/human.chrY.cds.giv2.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,	ENST00000303728	228	668
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,	ENST00000477123	228	440
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,	ENST00000651177	587	8251
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,	ENST00000338981	946	8610
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,	ENST00000453031	1	710
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,	ENST00000303804	228	668
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,	ENST00000472391	228	440
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,	ENST00000361365	97	528
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,	ENST00000382772	79459
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,	ENST00000602732	527	559

# the above should be identical to the CDS regions we extracted from GTF with `convert2bed`
head test/human.chrY.cds.bed12
# Y	22501564	22514067	ENST00000303728	ENSG00000169789	+	0	0	0	3	69,116,256,	0,2644,12247,
# Y	22501564	22512665	ENST00000477123	ENSG00000169789	+	0	0	0	3	69,116,28,	0,2644,11073,
# Y	12709447	12859413	ENST00000651177	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12709447	12859413	ENST00000338981	ENSG00000114374	+	0	0	0	44	96,149,80,113,219,116,252,139,153,105,207,137,134,88,343,96,212,241,150,121,131,279,126,126,170,109,147,147,223,221,191,174,142,751,124,226,130,186,221,89,157,213,96,135,	0,11141,12660,15665,17127,26164,26550,26963,28709,30077,47744,49058,51036,61620,64135,66023,67201,68571,69158,70078,76755,77074,80959,82051,83584,100731,101224,102187,103382,106676,108972,124240,128463,130416,131562,132792,133616,136885,137485,137791,146892,147185,148118,149831,
# Y	12847044	12859413	ENST00000453031	ENSG00000114374	+	0	0	0	5	109,157,213,96,135,	0,9295,9588,10521,12234,
# Y	22072325	22084839	ENST00000303804	ENSG00000169807	-	0	0	0	3	256,116,69,	0,9754,12445,
# Y	22073730	22084839	ENST00000472391	ENSG00000169807	-	0	0	0	3	28,116,69,	0,8349,11040,
# Y	20575871	20592343	ENST00000361365	ENSG00000198692	+	0	0	0	7	16,84,104,51,82,92,3,	0,3736,6718,8602,12152,13612,16469,
# Y	20575871	20592343	ENST00000382772	ENSG00000198692	+	0	0	0	6	16,84,104,82,92,3,	0,3736,6718,12152,13612,16469,
# Y	22992343	22992376	ENST00000602732	ENSG00000183753	+	0	0	0	1	33,	0,

Converison between genomic and transcriptomic positions for individual sites

cut -f1,2 test/human.chrY.cds.tiv2.tsv >test/human.chrY.cds.start.tpos.tsv
gppy t2g -g test/human.chrY.gtf -i test/human.chrY.cds.start.tpos.tsv >test/human.chrY.cds.start.gpos.tsv

head test/human.chrY.cds.start.tpos.tsv
# ENST00000303728	228
# ENST00000477123	228
# ENST00000651177	587
# ENST00000338981	946
# ENST00000453031	1
# ENST00000303804	228
# ENST00000472391	228
# ENST00000361365	97
# ENST00000382772	79
# ENST00000602732	527

head test/human.chrY.cds.start.gpos.tsv
# ENST00000303728	228	Y	+	22501565
# ENST00000477123	228	Y	+	22501565
# ENST00000651177	587	Y	+	12709448
# ENST00000338981	946	Y	+	12709448
# ENST00000453031	1	Y	+	12847045
# ENST00000303804	228	Y	-	22084839
# ENST00000472391	228	Y	-	22084839
# ENST00000361365	97	Y	+	20575872
# ENST00000382772	79	Y	+	20575872
# ENST00000602732	527	Y	+	22992344

cut -f1,5 test/human.chrY.cds.start.gpos.tsv >test/human.chrY.cds.start.gpos2.tsv
gppy g2t -g test/human.chrY.gtf -i test/human.chrY.cds.start.gpos2.tsv >test/human.chrY.cds.start.tpos2.tsv

head test/human.chrY.cds.start.gpos2.tsv
# ENST00000303728	22501565
# ENST00000477123	22501565
# ENST00000651177	12709448
# ENST00000338981	12709448
# ENST00000453031	12847045
# ENST00000303804	22084839
# ENST00000472391	22084839
# ENST00000361365	20575872
# ENST00000382772	20575872
# ENST00000602732	22992344

head test/human.chrY.cds.start.tpos2.tsv
# ENST00000303728	22501565	228	exon
# ENST00000477123	22501565	228	exon
# ENST00000651177	12709448	587	exon
# ENST00000338981	12709448	946	exon
# ENST00000453031	12847045	1	exon
# ENST00000303804	22084839	228	exon
# ENST00000472391	22084839	228	exon
# ENST00000361365	20575872	97	exon
# ENST00000382772	20575872	79	exon
# ENST00000602732	22992344	527	exon

Usage

List utilities

$ gppy -h
usage: gppy|gtf.py [-h] {txinfo,convert2bed,t2g,g2t,tiv2giv,giv2tiv,extract_thick} ...

GTF file manipulation

options:
  -h, --help            show this help message and exit

GTF operations:
  {txinfo,convert2bed,t2g,g2t,tiv2giv,giv2tiv,extract_thick}
                        supported operations
    txinfo              summary information of each transcript
    convert2bed         convert GTF to bed12 format
    t2g                 convert tpos to gpos
    g2t                 convert gpos to tpos
    tiv2giv             convert tiv to giv
    giv2tiv             convert giv to tiv
    extract_thick       Extract nested thick regions from bed12

Extract basic transcript information

$ gppy txinfo -h
usage: gppy|gtf.py txinfo [-h] [-g GTF]

options:
  -h, --help         show this help message and exit
  -g GTF, --gtf GTF  input gtf file (default: -)

Extract transcript/CDS/UTR features in GTF as bed12 format

$ gppy convert2bed -h
usage: gtf.py convert2bed [-h] [-g GTF] [-t {exon,cds,utr5,utr3}] [-e EXTEND]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -t {exon,cds,utr5,utr3}, --type {exon,cds,utr5,utr3}
                        types of intervals to be converted to bed for each transcript (default: exon)
  -e EXTEND, --extend EXTEND
                        number of bases to extend at both sides (default: 0)

Convert transcript positions to genomic positions

$ gppy t2g -h
usage: gppy|gtf.py t2g [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first two columns composed of tx_id and transcript coordinates (default: None)

Convert transcript intervals to genomic intervals (allow spliced regions)

$ gppy tiv2giv -h
usage: gppy|gtf.py tiv2giv [-h] [-g GTF] [-i INFILE] [-a]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first three columns composed of tx_id, start and end coordinates (default: None)
  -a, --append          whether to append input at the end of the ouput (default: False)

Convert genomic positions to transcript positions

$ gppy g2t -h
usage: gppy|gtf.py g2t [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first two columns composed of tx_id and genomic coordinates (default: None)

Convert genomic intervals to transcript intervals

$ gppy giv2tiv -h
usage: gppy|gtf.py giv2tiv [-h] [-g GTF] [-i INFILE]

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     input gtf file (default: -)
  -i INFILE, --infile INFILE
                        tab-delimited file with the first three columns composed of tx_id, start and end coordinates (default: None)

Links

  • GTF format check and fix: AGAT

Other

Please use the issues section to report if you have spotted any bug or want a feature to be implemented :)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gppy-0.1.4.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

gppy-0.1.4-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file gppy-0.1.4.tar.gz.

File metadata

  • Download URL: gppy-0.1.4.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for gppy-0.1.4.tar.gz
Algorithm Hash digest
SHA256 7fbe3b92ce2ab80627a62d5d5ffdcbe8390f8337648bd171e3ce1c6e731ad146
MD5 7d14c2c6382edc3a0369e270b4ac4821
BLAKE2b-256 931f93ba0b78f442dcb36dda367a20f93bbd76c459cd965ae4f7a6dd962a2586

See more details on using hashes here.

File details

Details for the file gppy-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: gppy-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for gppy-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0f35500a079619c7bad4d029ff1fd6c631429f1262c644cb9aaec22455317c5a
MD5 0f972d228cc8243a9f146216b6a6fcf5
BLAKE2b-256 1b52f230840985c3d0a1dea24b4fc01559da62c3b6386a0e31106856de629add

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page