Quantifying transposable element (TEs) expression from single-cell sequencing data
Project description
scTE: Quantifying transposable element (TEs) expression from single-cell sequencing data
scTE takes as input:
- Aligned sequence reads (BAM/SAM format)
- The genomic location of TEs (BED format)
- The genomic location of genes (GTF format)
Note
This repository is a fork from https://github.com/JiekaiLab/scTE
Installation
From PyPI
$ pip install scte-quant
From conda
It is recommended to use conda for installation, since it enhanced reproducibility and easier to manage dependencies.
$ conda create -n scte --channel-priority 0 --override-channels -c bioconda -c conda-forge -c billsfriend scte
From source
$ git clone https://gitee.com/billsfriend/scTE
$ cd scTE
$ pip install .
Usage
Building genome indices
scTE builds genome indices for the fast alignment of reads to genes and TEs. These indices can be automatically generated using the commands:
$ scTE_build -g mm10 # Mouse
$ scTE_build -g hg38 # Human
$ scTE_build -g panTro6 # Chimpanzee
$ scTE_build -g macFas5 # Macaca fascicularis
$ scTE_build -g dm6 # Drosophila melanogaster
$ scTE_build -g danRer11 # Zebrafish
$ scTE_build -g xenTro9 # Xenopus tropicalis
These scripts will automatically download the genome annotations, for mouse:
$ ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M21/gencode.vM21.annotation.gtf.gz
$ http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.txt.gz
Or for human:
$ ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz
$ http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
Or for Chimpanzee:
$ http://ftp.ensembl.org/pub/release-103/gtf/pan_troglodytes/Pan_troglodytes.Pan_tro_3.0.103.gtf.gz
$ https://hgdownload.soe.ucsc.edu/goldenPath/panTro6/database/rmsk.txt.gz
Or for Macaca fascicularis:
$ http://ftp.ensembl.org/pub/release-102/gtf/macaca_fascicularis/Macaca_fascicularis.Macaca_fascicularis_5.0.102.gtf.gz
$ http://hgdownload.soe.ucsc.edu/goldenPath/macFas5/database/rmsk.txt.gz
Or for Drosophila melanogaster:
$ http://ftp.ensembl.org/pub/release-103/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.103.gtf.gz
$ http://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/rmsk.txt.gz
Or for Zebrafish:
$ http://ftp.ensembl.org/pub/release-103/gtf/danio_rerio/Danio_rerio.GRCz11.103.gtf.gz
$ https://hgdownload.soe.ucsc.edu/goldenPath/danRer11/database/rmsk.txt.gz
Or for Xenopus tropicalis:
$ http://ftp.ensembl.org/pub/release-103/gtf/xenopus_tropicalis/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.103.gtf.gz
$ https://hgdownload.soe.ucsc.edu/goldenPath/xenTro9/database/rmsk.txt.gz
mm10, hg38, panTro6, macFas5, dm6, danRer11, xenTro9 is the genome
assembly version. If you want to use your customs reference, you can use
the -gene -te options:
scTE_build -te TEs.bed -gene Genes.gtf -o costum
-te
Bed file for transposable elements annotation with at least 4 columns of chr, start, end & name of TE. Support .gz format.
-gene
Gtf file for genes annotation. Support .gz format.
For TEs.bed and Genes.gtf of other versions and species, TEs.bed derived from (rmsk.txt.gz)
on UCSC goldenPath and Genes.gtf (<species>.gtf.gz) from
Ensembl are well-tested and recommended.
Note that rmsk.txt.gz downloaded from UCSC goldenPath need to be converted into 4-column bed format before supplied to -te option.
A simple zcat rmsk.txt.gz | cut 6-8,11 > rmsk.TE.bed will do.
For pre-set genomes in -g options, TEs in rmsk.txt.gz are filtered to include only LINE, SINE, LTR, Retrotranspon,
Satellite and DNA (DNA TE). Satellite DNA is not classified as TE by the convention. If you want to customize your genome
indices of TE, please filter TEs.bed as your will before running scTE_build.
For more information about BED and GTF format, see from
UCSC. These annotations are
then processed and converted into genome indices. The scTE algorithm
will allocate reads first to gene exons, and then to TEs by default.
Hence TEs inside exon/UTR regions of genes annotated in GENCODE will
only contribute to the gene, and not to the TE score. This feature can
be changed by setting –mode/-m inclusive in scTE, which will instruct
scTE to assign the reads to both TEs and genes if a read comes from a TE
inside exon/UTR regions of genes. If you want to remove the TEs inside
the intron of genes, you can sete –mode/-m nointron in scTE
Analysis of 10x style scRNA-seq data
scTE makes BAM/SAM file as input, highly recommend to use unfiltered alignment file as input.
For bam file generated by
STARsolo etc, the cell barcodes and
UMI need to be integrated into the read 'CR:Z' or 'UR:Z' tage as bellow:
$ scTE -i inp.bam -o out -x mm10.exclusive.idx --hdf5 True -CB CR -UMI UR
$ samtools view test.bam
A00269:12:H7YF2DMXX:2 0 chr10 55902580 255 50M * 0 0 GTTCTCTCCGTATGTGAGCATGGGAGATACATCCCAGAAAGGCAGAAGGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:49 nM:i:0 CR:Z:CTAGAGTGTTTCGCTC CY:Z:FFFFFFFFFFFFFFFF UR:Z:TACATGACGC UY:Z:FFFFFFFFFF
A00269:13:H7YF2DMXX:2 0 chr10 55902784 255 50M * 0 0 ATAATCTTTGAGATCTCTGGTGAAAATAAGTAGCATAAAGGACAGAATCA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:49 nM:i:0 CR:Z:CTAGAGTGTTTCGCTC CY:Z:FFFFFFFFFFFFFFFF UR:Z:TACATGACGC UY:Z:FFFFFFFFFF
A00269:14:H7YF2DMXX:2 0 chr13 67837311 255 50M * 0 0 CTGTTCATTATTTGAGGAAATCAGGACAGGAAATCAAACATGGCAGAATC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:49 nM:i:0 CR:Z:ATCGAGTGTTTCGCTC CY:Z:FFFFFFFFFFFFFFFF UR:Z:TACATGACGC UY:Z:FFFFFFFFFF
A00269:15:H7YF2DMXX:2 0 chr14 114380523 255 50M * 0 0 GATCCAGATTAATTGAGACTGTTGATCCTCCTACAGGGTCGCCCTTCTCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:49 nM:i:0 CR:Z:CTAGAGTGTTTCGCTC CY:Z:FFFFFFFFFFFFFFFF UR:Z:TACATGACGC UY:Z:FFFFFFFFFF
For bam file generated by Cell
Ranger
etc, the cell barcodes and UMI need to be integrated into the read
'CB:Z' or 'UB:Z' tage as bellow:
$ scTE -i inp.bam -o out -x mm10.exclusive.idx --hdf5 True -CB CB -UMI UB
$ samtools view test.bam
A00519:758:HTCCHDSXY:3:2535:21296:19774 16 chr1 14021 0 90M * 0 0 TGGATTTCTATCTCCCTGGCTTGGTGCCAGTTCCTCCAAGTCGATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATC ,FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFF:FFFFF NH:i:5 HI:i:1 AS:i:88 nM:i:0 RG:Z:SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K:0:1:HTCCHDSXY:3 RE:A:I xf:i:0 CR:Z:CTCCCTCCACTGCGAC CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTCCCTCCACTGCGAC-1 UR:Z:AAGGCGTAGTAG UY:Z:FFFFFFFFFFFF UB:Z:AAGGCGTAGTAG
A00519:758:HTCCHDSXY:1:1355:17237:31720 0 chr1 14260 0 90M * 0 0 CTCCCTCTCATCCCAGAGAAACAGGTCAGCTGGGAGCTTCTGCCCCCACTGCCTAGGGACCAACAGGGGCAGGAGGCAGTCACTGACCCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:5 HI:i:1 AS:i:88 nM:i:0 RG:Z:SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K:0:1:HTCCHDSXY:1 RE:A:I xf:i:0 CR:Z:TCGTCCACAGTATGAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:TCGTCCACAGTATGAA-1 UR:Z:GACTTATTTTTT UY:Z:FFFFFFFFFFFF UB:Z:GACTTATTTTTT
A00519:758:HTCCHDSXY:3:2227:16703:32080 16 chr1 14411 1 90M * 0 0 TCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAG FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 AS:i:88 nM:i:0 RG:Z:SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K:0:1:HTCCHDSXY:3 RE:A:I xf:i:0 CR:Z:TTGAGTGGTTGTGGCC CY:Z:FFFFFFFFFFFFFFFF CB:Z:TTGAGTGGTTGTGGCC-1 UR:Z:TATAATGCTCAG UY:Z:FFFFFFFFFFFF UB:Z:TATAATGCTCAG
A00519:758:HTCCHDSXY:3:2563:23665:33802 16 chr1 14411 1 90M * 0 0 TCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAG FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 AS:i:88 nM:i:0 RG:Z:SC3_v3_NextGem_DI_CellPlex_Human_PBMC_10K:0:1:HTCCHDSXY:3 RE:A:I xf:i:0 CR:Z:TGTTGAGAGGCAATGC CY:Z:FFFFFFFFFFFFFFFF CB:Z:TGTTGAGAGGCAATGC-1 UR:Z:ACGGGTGTGGAG UY:Z:FFFFFFFFFFFF UB:Z:ACGGGTGTGGAG
-i
Input file: BAM/SAM file from CellRanger or STARsolo
-o
Output file prefix
-x
The filename of the index for the reference genome annotation generated by scTE_build
-p
Number of threads to use, Default: 1. scTE takes ~10Gb memory each thread for human and mouse genome.
--hdf5
Save the output as .h5ad formatted file instead of csv file. Default: False
scTE is most tuned to STARsolo or
the Cell
Ranger
pipeline outputs, and can accept BAM files produced by either of these
two programs. For other aligners, the barcode should be stored in the
CR:Z or CB:Z tag, and the UMI in the UR:Z or UB:Z tag in the BAM
file
Analysis of C1 style scRNA-seq data
If the UMI is missing or not
used in the scRNA-seq technology (for example on the Fluidigm C1
platform), it can be disabled with –UMI False (the default is True)
switch in scTE. If the barcode is missing it can be disabled with the
–CB False (the default is True), and instead the cell barcodes will be
taken from the names of the BAM files.
$ scTE -i inp.bam -o out -x mm10.exclusive.idx -CB False -UMI False
multiple BAM files can be provided to scTE with the –i option
$ scTE -i *.bam -o out -x mm10.exclusive.idx -CB False -UMI False
or
$ scTE -i input1.bam,input2.bam,... -o out -x mm10.exclusive.idx -CB False -UMI False
Analysis of scATAC-seq data
The genome indices were prebuilt using:
$ wget -c http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/rmsk.txt.gz -O mm10.te.txt.gz
$ zcat mm10.te.txt.gz | grep -E 'LINE|SINE|LTR|Retroposon' | cut -f6-8,11 >mm10.te.bed
$ scTEATAC_build -g mm10.te.bed -o mm10.te.atac
Then the bam file can processe using scTE with the command:
scTEATAC -i input.bam -x mm10.te.atac.idx
Citation
If scTE is useful for your research, consider citing Nature Communications (2021)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scte_quant-1.3.4.tar.gz.
File metadata
- Download URL: scte_quant-1.3.4.tar.gz
- Upload date:
- Size: 203.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3b537d5cbe6ef4691274b2fb343bb8ae036e1ef7e0f12d689a8283eb827e86c
|
|
| MD5 |
d7e7f2512657772d141a8014cba63a66
|
|
| BLAKE2b-256 |
e38c156687aec1fa986a4bf6a6d7d8efc00d35063344d5744609a1114755341b
|
File details
Details for the file scte_quant-1.3.4-py3-none-any.whl.
File metadata
- Download URL: scte_quant-1.3.4-py3-none-any.whl
- Upload date:
- Size: 57.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8643eceacb2a608a0428f60a22bd9ecc3a2523527f809939557b4ce54ece982
|
|
| MD5 |
c0cd2f066efc7e69e7c3625549242233
|
|
| BLAKE2b-256 |
0ded35222fb4829e6391d1b472559ab93a55fec79bdf60d21018f2f74d2e33a2
|