CLI tool for preparing data submission to Gene Expression Omnibus
Project description
gut - gEO upload tool
Last updated: 2026-02-04
GEO is amazing. Uploading data to GEO is not. I wrote this tool to ease the pain of preparing all of the files and metadata associated with uploading a dataset of high throughput sequencing data to GEO. The tool makes the process much less manual, tedious, and error prone, as it requires well structured tabular input that can be checked automatically for common problems.
Installation & Prerequisites
Requirements:
- Python 3.12 or higher (breaking change from previous versions which supported Python 3.4+)
- STAR aligner for paired-end insert size calculation
- Recommended: conda for installing bioinformatics tools (STAR, samtools)
Install using pip:
pip install geo-upload-tool
Note: For bioinformatics dependencies like STAR and samtools, we recommend using conda as these tools have system-level dependencies that are easier to manage through conda:
conda install -c conda-forge -c bioconda STAR samtools
Getting Started
The quickest way to get started is to use the gut init command to create a new project directory with template files:
gut init my_geo_submission
cd my_geo_submission
This creates a directory with four template files:
sample_info.csv- Template for sample metadatafile_info.csv- Template for file paths and metadata.env.template- Template for FTP credentialsREADME.md- Quick start guide
Edit the CSV files with your actual data, following the inline comments and examples. Then proceed with the workflow below.
Basic Usage
The entire process is driven from two CSV files: sample_info and file_info. Described below:
Sample Info
This information makes up the SAMPLES section. The CSV should have exactly one row per sample, and all of the following required columns:
- Sample name: unique name for this sample
- source name: sample source, e.g. brain
- organism: name of the organism, e.g. human
- molecule: one of a set of controlled vocabulary, listed below
- description (optional): description of the sample, if desired
You may include as many more columns in the file as you like, and they will all be added as characteristic: tag columns under the SAMPLES section.
NB: The Sample name column is used to cross reference with the file info, which is discussed next.
File Info
This information is used to derive the RAW FILES, PROCESSED DATA FILES, and PAIRED-END EXPERIMENTS sections, as well as the processed data file and raw file columns of the SAMPLES section. gut infers whether a file is raw or processed based on the rectype column (see below). The CSV should have at least one raw and at least one processed file per sample in the sample info file (GEO requires this).
The raw files are always fastq files, and there should be one row per fastq file per sample, e.g.:
Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fastq.gz,hg38.fa,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fastq.gz,hg38.fa,2
A2,SE fastq,fastq,Illumina HiSeq 2000,A2_R1.fastq.gz,hg38.fa,1
The processed files may be any other type of file, and there must be at least one processed file for each sample in the sample info file, e.g. continuing from above example:
A1,wig,na,na,A1.wig.gz,na,na
A1,csv,na,ns,raw_counts.csv,na,na
A2,wig,na,na,A2.wig.gz,na,na
A2,csv,na,ns,raw_counts.csv,na,na
Note the same file raw_counts.csv is provided for both samples, since the raw counts matrix often contains processed data for all samples. The CSV file must have all of the following columns with column names in the first row:
-
Sample name: unique name for this sample, corresponds to sample info
-
rectype:
- for RAW files: either "PE fastq" or "SE fastq"
- for PROCESSSED files: anything appropriate for the file (e.g. csv, txt, wig, etc)
-
file type:
- for RAW files: one of a controlled vocabulary, listed below
- for PROCESSED files: value ignored
-
instrument model:
- for RAW files: one of a controlled vocabulary, listed below required
- for PROCESSED files: value ignored
-
path: the relative or absolute path to the file on your local system
-
ref_fa: (optional)
- for RAW files only: a local path or URL to a fasta reference sequence that can be used to compute average insert size and standard deviation for paired-end datasets
- Supports local paths and URLs (http://, https://, ftp://)
- URLs are automatically downloaded and cached in
.cache/references/ - Example local path:
/path/to/reference.fa - Example URL:
https://ftp.ensembl.org/pub/.../genome.fa.gz
-
end: required only for rectype == "PE fastq": either 1 or 2 indicating the end of the fastq file
-
alias: (optional) a clean filename to use in the GEO submission directory instead of the original on-disk filename. When provided and non-empty, the symlink (or copy) created in the output directory will use the alias name, and the alias will appear as the file name in GEO metadata. This is useful when raw pipeline filenames contain run-specific tokens that are irrelevant to the submission.
Example: if the file on disk is
DH-WT_IBH1-1_UNUSEFUL_INFO_R1_concat.fa.gz, you can setaliastoIBH1-1_R1_concat.fa.gzand it will be uploaded under that name. Alias values must be unique within the submission.
Any additional fields in the file info file are quite friendly ignored.
Validate
With the above CSVs prepared, you can validate them, to make sure everything lines up as expected:
gut validate -o my_geo_submission sample_info.csv file_info.csv
The -o argument is the name of the directory that will be created to stage
the files (GEO requires the directory be named the same as your email). The
validation logic checks to make sure everything lines up between your samples
and files, e.g. make sure all samples are in both files, each sample has both
raw and processed files, etc.
Build
Once you have fixed all the problems and validation is successful, you can build the staged directory:
gut build -o my_geo_submission sample_info.csv file_info.csv
This will do the following:
- Symlink (or copy with
--copy) all of the raw and processed files into the staging directory - Compute md5 checksum on all files
- Identify read length and single- or paired-endedness for any fastq files
- Calculate the average insert size and standard deviation for paired-end
fastq files, using STAR and
--ref-fa=FAor ref_fa file_info columns to specify the reference sequence. - Construct a metadata file with SAMPLES, PROCESSED DATA FILES, RAW FILES, and PAIRED-END EXPERIMENTS sections filled out appropriately, saved as an excel file in the staging directory
If all went well, the file metadata_TOFILL.xlsx should exist inside the
staging directory. As the TOFILL part suggests, you need to fill it out
some more, as the other sections (e.g. SERIES) are not yet complete, unless
you provided the other sections with the --addnl CLI flag (see below). I
suggest you create a copy named metadata_complete.xlsx or something in the
staging directory and fill that out. Be on the lookout for errors and blank
fields; I surely didn't think to check for every possible mistake.
If you wish to automate the whole honking process, you may also provide a
CSV file with the SERIES, PROTOCOLS, and DATA PROCESSING PIPELINE sections
filled out. The file should contain ONLY these sections, with fields taken
from the v2.1 template. There is an example file in the root of this repo
you may use as a starting point. Once the file is completed, you may provide
it to gut with the --addnl CLI option. The resulting metadata_TOFILL.xlsx
will have these fields incorporated, and if you were thorough, you might
not need to edit it at all. As per below, the metadata files created by gut
do not upload by default, so you will still have to copy or rename the
metadata file (e.g. to metadata.xlsx) for gut to know to upload it.
Using Reference Genomes from URLs
gut now supports downloading reference FASTA files directly from URLs, eliminating the need to manually download large genome files. This is particularly useful for paired-end insert size calculation.
Supported URL types:
- HTTP:
http://example.com/genome.fa - HTTPS:
https://ftp.ensembl.org/pub/.../genome.fa.gz - FTP:
ftp://ftp.ncbi.nlm.nih.gov/.../genome.fa
How it works:
- URLs are automatically detected when you provide them via
--ref-faflag or in theref_facolumn - Files are downloaded once and cached in
.cache/references/directory - Subsequent builds reuse the cached file (no re-download)
- Both local paths and URLs work interchangeably
Example with CLI flag:
gut build -o my_geo_submission \
--ref-fa=https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
sample_info.csv file_info.csv
Example with per-sample URLs in file_info.csv:
Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fq.gz,https://example.com/hg38.fa.gz,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fq.gz,https://example.com/hg38.fa.gz,2
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R1.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,1
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R2.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,2
Cache management:
- Cached files are stored in
.cache/references/within your output directory - Each cached file has a
.urlsidecar file storing the original URL - To clean cache, simply delete files from
.cache/references/ - Cache is verified before reuse (URL must match)
Troubleshooting:
- Network timeout: Large files (2-3GB) may take 15-30 minutes to download
- 404 error: Verify the URL is correct and accessible in your browser
- Firewall issues: HTTP/HTTPS URLs are generally more reliable than FTP
- Manual fallback: You can always download the file manually and provide a local path
Upload
Once you have filled in the missing metadata and put the new file into the
staging directory, you are ready to upload. You will have to initiate the
upload process from the GEO website and receive an upload directory, e.g.
uploads/your@email.edu_mXoLeWqE. An FTP client is built into python and
gut uses this to upload just the staged files.
Credential Management
For security, gut no longer accepts passwords as command-line arguments. Instead, credentials can be provided through multiple secure methods:
Method 1: Environment Variables (Recommended for CI/CD)
export GEO_FTP_USER=geousername
export GEO_FTP_PASSWORD=geopassword
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
Method 2: .env File (Recommended for Local Development)
Create a .env file in your project directory:
# .env file
GEO_FTP_USER=geousername
GEO_FTP_PASSWORD=geopassword
Make sure to add .env to your .gitignore to avoid committing credentials!
Then run the upload command:
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
Method 3: Command Line Flag + Interactive Prompt (Most Secure)
gut upload --user geousername my_geo_submission uploads/your@email.edu_mXoLeWqE
# Password: [you will be prompted securely]
Method 4: Fully Interactive
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
# FTP Username: [enter username]
# FTP Password: [enter password securely]
You can get the geousername and geopassword from the GEO website upon initiating an upload. Your submission should be done in a matter of hours to days, depending on how big your data are. Then the iteration begins.
Security Notes:
- Passwords are never displayed in logs or terminal output
- Interactive prompts use hidden input (getpass)
- Environment variables from
.envfiles are never committed to version control - Avoid including credentials in shell scripts that may be shared
NB: gut will upload everything in the staging directory except:
- files with TOFILL in the name
- the .cache directory, which contains a bunch of stuff gut made for processing the files
You can put other things in there you want to upload if you so desire.
Sometimes upload can fail part way through, especially when uploading many
large files. To avoid unnecessary re-uploads, the upload routine checks
for the presence of each file on the server before uploading, and if the
remote and local file sizes are the same, upload is skipped. You can turn
this behavior off and force upload every time by supplying --no-cache to the
upload command.
Development
Setup
This project uses uv for dependency management and packaging.
Set up development environment:
# Clone the repository
git clone https://bitbucket.org/bucab/gut
cd gut
# Install dependencies (including dev dependencies)
uv sync --extra dev
# Install with optional bioinformatics dependencies (pysam for BAM/CRAM support)
uv sync --all-extras
Development Workflow
Running tests:
uv run pytest
Code quality checks:
# Format code with Black
uv run black .
# Lint with Ruff
uv run ruff check .
Pre-commit hooks:
This project uses pre-commit hooks to ensure code quality. Install them with:
uv run pre-commit install
The hooks will automatically run on staged files before each commit. To bypass the hooks (not recommended), use:
git commit --no-verify
Build & Release
Build distribution packages:
uv build
This creates both wheel (.whl) and source distribution (.tar.gz) in the dist/ directory.
Detailed Documentation
TODO
Controlled Field Values
molecule
If seq_template_v2.1.xls is to be believed, molecule must be precisely
one of:
- total RNA
- polyA RNA
- cytoplasmic RNA
- nuclear RNA
- genomic DNA
- protein
- other
rectype
These values are gut-specific, and used to help figure out what to do with the files. The files that end up in the RAW FILES section are:
- PE fastq
- SE fastq
Anything else ends up in the PROCESSED DATA FILES section (e.g. csv, txt, peak, wig, bed, gff, etc).
file type
These are the accepted filetype values:
- fastq
- Illumina_native_qseq
- Illumina_native
- SOLiD_native_csfasta
- SOLiD_native_qual
- sff
- 454_native_seq
- 454_native_qual
- Helicos_native
- srf
- PacBio_HDF5
instrument model
According to seq_template_v2.1.xls, instrument model must be one of:
- Illumina Genome Analyzer
- Illumina Genome Analyzer II
- Illumina Genome Analyzer IIx
- Illumina HiSeq 2000
- Illumina HiSeq 1000
- Illumina MiSeq
- Illumina NextSeq
- AB SOLiD System
- AB SOLiD System 2.0
- AB SOLiD System 3.0
- AB SOLiD 4 System
- AB SOLiD 4hq System
- AB SOLiD PI System
- AB 5500xl Genetic Analyzer
- AB 5500 Genetic Analyzer
- 454 GS
- 454 GS 20
- 454 GS FLX
- 454 GS Junior
- 454 GS FLX Titanium
- Helicos HeliScope
- PacBio RS
- PacBio RS II
- Complete Genomics
- Ion Torrent PGM
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geo_upload_tool-2.2.1.tar.gz.
File metadata
- Download URL: geo_upload_tool-2.2.1.tar.gz
- Upload date:
- Size: 127.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14931a685b2016fdc951e621a834a9f32c7fcffb0361b22a097689091d92ca5f
|
|
| MD5 |
9e38f104f36f660391cb872fa9f30060
|
|
| BLAKE2b-256 |
963c0d3745448797ab51686e140b1c1535677a00308fb9a8b7e45517a0277098
|
Provenance
The following attestation bundles were made for geo_upload_tool-2.2.1.tar.gz:
Publisher:
publish.yml on BU-Neuromics/gut
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geo_upload_tool-2.2.1.tar.gz -
Subject digest:
14931a685b2016fdc951e621a834a9f32c7fcffb0361b22a097689091d92ca5f - Sigstore transparency entry: 1149760206
- Sigstore integration time:
-
Permalink:
BU-Neuromics/gut@6b5625c0b75ec0f2ca85237974af74593cd41a7e -
Branch / Tag:
refs/tags/v2.2.1 - Owner: https://github.com/BU-Neuromics
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6b5625c0b75ec0f2ca85237974af74593cd41a7e -
Trigger Event:
push
-
Statement type:
File details
Details for the file geo_upload_tool-2.2.1-py3-none-any.whl.
File metadata
- Download URL: geo_upload_tool-2.2.1-py3-none-any.whl
- Upload date:
- Size: 126.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2d2bcfa5583fde1ecc981cd50a43a2cb1fe00edd78ca26f17806dd0f9b20774
|
|
| MD5 |
ec04719df97e139c51e2c962c104cdeb
|
|
| BLAKE2b-256 |
a0626435059201f229ff58c1e1010d3d26a53671de5e29a5f2843d224eeb9023
|
Provenance
The following attestation bundles were made for geo_upload_tool-2.2.1-py3-none-any.whl:
Publisher:
publish.yml on BU-Neuromics/gut
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
geo_upload_tool-2.2.1-py3-none-any.whl -
Subject digest:
a2d2bcfa5583fde1ecc981cd50a43a2cb1fe00edd78ca26f17806dd0f9b20774 - Sigstore transparency entry: 1149760244
- Sigstore integration time:
-
Permalink:
BU-Neuromics/gut@6b5625c0b75ec0f2ca85237974af74593cd41a7e -
Branch / Tag:
refs/tags/v2.2.1 - Owner: https://github.com/BU-Neuromics
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6b5625c0b75ec0f2ca85237974af74593cd41a7e -
Trigger Event:
push
-
Statement type: