Skip to main content

User-friendly tool to download TCGA STAR RNA-seq counts from the GDC portal

Project description

tcga-gdc-downloader

Tests Python License Platform

A command-line and graphical tool to download TCGA RNA-seq gene expression count data from the NCI GDC portal, assemble a ready-to-analyse gene × sample count matrix, and automatically annotate every sample with GDC clinical data and harmonised survival and subtype annotations from the PanCanAtlas Clinical Data Resource (Liu et al., Cell 2018).

No coding required. Works from a graphical interface or a single terminal command.


Features

  • Downloads all open-access STAR-Counts RNA-seq files for any of the 33 TCGA cancer projects
  • Assembles a gene × sample count matrix (genes as rows, samples as columns)
  • Fetches GDC clinical data (demographics, tumour stage, histology, survival)
  • Merges PanCanAtlas CDR annotations: four curated survival endpoints (OS, DSS, DFI, PFI) and molecular subtypes
  • Classifies every sample as Tumor or Normal from the TCGA barcode
  • Produces separate output files for all samples, tumor-only, and normal-only
  • Flags CDR-complete cases (full survival data) for clean analytical cohorts
  • Checkpoint system — safely resumes interrupted downloads without re-downloading
  • Graphical 9-step wizard (Streamlit) and command-line interface

Installation

Requires Python 3.10 or newer.

pip install tcga-gdc-downloader

For the graphical interface, install the GUI extra:

pip install "tcga-gdc-downloader[gui]"

Quick Start

Command line

# Discover files without downloading (recommended first step)
tcga-download --project TCGA-BRCA --dry-run

# Full download and annotation
tcga-download --project TCGA-BRCA --output ~/my_data

# Skip CDR annotation step
tcga-download --project TCGA-BRCA --output ~/my_data --no-cdr

# Resume an interrupted download
tcga-download --project TCGA-BRCA --output ~/my_data

# Start completely fresh
tcga-download --project TCGA-BRCA --output ~/my_data --fresh

Graphical interface

tcga-download --gui

A browser window opens with a 9-step wizard. No further terminal commands needed.

Python API

from tcga_downloader import GDCClient, build_count_matrix, run_cdr_pipeline

client = GDCClient()
hits = client.discover_star_files("TCGA-BRCA")

Output Files

For a project called TCGA-BRCA, the following files are written to your output directory:

File Contents
TCGA-BRCA_STAR_unstranded_merged_ALL.tsv All samples — counts + GDC clinical
TCGA-BRCA_STAR_unstranded_TUMOR_ONLY.tsv Tumor samples only
TCGA-BRCA_STAR_unstranded_NORMAL_ONLY.tsv Normal samples only (if present)
TCGA-BRCA_sample_metadata_clinical.tsv Metadata + clinical only, no counts
TCGA-BRCA_FULL_merged_with_CDR.tsv All samples — counts + GDC clinical + CDR
TCGA-BRCA_CDR_annotations.tsv CDR columns only
TCGA-BRCA_CDR_coverage_report.tsv Per-field CDR coverage statistics
TCGA-BRCA_FULL_merged_CDR_complete_cases.tsv Samples with complete survival data
TCGA-BRCA_CDR_unmatched_cases.txt Cases not found in CDR (if any)

Opening your files

# Python
import pandas as pd
df = pd.read_csv("TCGA-BRCA_FULL_merged_with_CDR.tsv", sep="\t", index_col=0)
# R
df <- read.table("TCGA-BRCA_FULL_merged_with_CDR.tsv", sep="\t", header=TRUE, row.names=1)

Supported Projects

All 33 TCGA cancer projects are supported:

TCGA-ACC TCGA-BLCA TCGA-BRCA TCGA-CESC TCGA-CHOL TCGA-COAD TCGA-DLBC TCGA-ESCA TCGA-GBM TCGA-HNSC TCGA-KICH TCGA-KIRC TCGA-KIRP TCGA-LAML TCGA-LGG TCGA-LIHC TCGA-LUAD TCGA-LUSC TCGA-MESO TCGA-OV TCGA-PAAD TCGA-PCPG TCGA-PRAD TCGA-READ TCGA-SARC TCGA-SKCM TCGA-STAD TCGA-TGCT TCGA-THCA TCGA-THYM TCGA-UCEC TCGA-UCS TCGA-UVM


GDC Authentication

TCGA STAR-Counts gene expression data is open access — no authentication token is required. A GDC token is only needed for controlled-access data (raw BAM files, genotype data), which this tool does not download.

To use a token if you have one:

tcga-download --project TCGA-BRCA --token ~/gdc-user-token.txt

PanCanAtlas CDR Annotations

The tool automatically downloads and merges the TCGA Pan-Cancer Clinical Data Resource (CDR) published in:

Liu, J. et al. (2018). An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell, 173(2), 400–416.e11. https://doi.org/10.1016/j.cell.2018.02.052

The CDR provides four curated survival endpoints that are more reliable than raw GDC fields for survival analysis:

Column Description
cdr_OS Overall survival event (0/1)
cdr_OS.time Overall survival time (days)
cdr_PFI Progression-free interval event
cdr_PFI.time Progression-free interval time (days)
cdr_DSS Disease-specific survival event
cdr_DSS.time Disease-specific survival time (days)
cdr_DFI Disease-free interval event
cdr_DFI.time Disease-free interval time (days)

Three audit flag columns are added to every sample:

Column Description
cdr_matched True if case was found in the CDR
cdr_subtype_available True if molecular subtype is populated
cdr_survival_complete True if OS, OS.time, PFI, PFI.time are all present

Samples not found in the CDR (e.g. cases added to GDC after the 2018 data freeze) are kept in all output files with cdr_matched = False and empty CDR columns — they are never silently dropped.


Requirements

  • Python >= 3.10
  • pandas >= 2.0
  • requests >= 2.31
  • openpyxl >= 3.1
  • tqdm >= 4.65
  • streamlit >= 1.35 (GUI only)

Limitations

  • Downloads one project at a time (no multi-project batch mode)
  • RNA-seq STAR-Counts only (mutations, CNV, methylation not included)
  • TCGA projects only (other GDC programs such as CPTAC and TARGET are not supported)
  • Open-access files only
  • CDR covers the TCGA 2018 data freeze; samples added after 2018 will not have CDR annotations
  • Tested on macOS, Linux, and Windows with Python 3.10, 3.11, 3.12, and 3.13

License

MIT License — see LICENSE for details.


Author

Orhan Nedim Kurt Independent Researcher


Citation

If you use this tool in your research, please cite:

Kurt, O.N. (2026). tcga-gdc-downloader: A tool for downloading and annotating 
TCGA RNA-seq data from the GDC portal. Zenodo. 
https://doi.org/10.5281/zenodo.18819588

DOI

Please also cite the PanCanAtlas CDR if you use the survival or subtype annotations:

Liu, J. et al. (2018). An Integrated TCGA Pan-Cancer Clinical Data Resource 
to Drive High-Quality Survival Outcome Analytics. Cell, 173(2), 400-416.e11.
https://doi.org/10.1016/j.cell.2018.02.052

Acknowledgements

Data downloaded via the NCI Genomic Data Commons (GDC) API. Survival and subtype annotations from the TCGA PanCanAtlas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tcga_gdc_downloader-2.1.2.tar.gz (59.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tcga_gdc_downloader-2.1.2-py3-none-any.whl (50.1 kB view details)

Uploaded Python 3

File details

Details for the file tcga_gdc_downloader-2.1.2.tar.gz.

File metadata

  • Download URL: tcga_gdc_downloader-2.1.2.tar.gz
  • Upload date:
  • Size: 59.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for tcga_gdc_downloader-2.1.2.tar.gz
Algorithm Hash digest
SHA256 08551097b060ac261dded65d7598155b23812fcd6a67171bf7bfa1bbe87ea949
MD5 fca77320bfdf877d319d5400aaa7cb89
BLAKE2b-256 ce6305de2086ab16bee04bff744301ca34c11343e7baa2c63468918b8d894c87

See more details on using hashes here.

File details

Details for the file tcga_gdc_downloader-2.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for tcga_gdc_downloader-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 38c14504e5c04623f82f50e4127fd9d9f777f8a4ebaa9ce8b46a8741ccfce382
MD5 a9956c6d891f393629f1ebd1312953ca
BLAKE2b-256 1f411488009103167959e90de75da35a5ed51a28b46ddd14180dd6e46e0ac7ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page