CLI tool for preparing data submission to Gene Expression Omnibus

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

labadorf

These details have not been verified by PyPI

Project description

gut - gEO upload tool

Last updated: 2026-02-04

GEO is amazing. Uploading data to GEO is not. I wrote this tool to ease the pain of preparing all of the files and metadata associated with uploading a dataset of high throughput sequencing data to GEO. The tool makes the process much less manual, tedious, and error prone, as it requires well structured tabular input that can be checked automatically for common problems.

Installation & Prerequisites

Requirements:

Python 3.12 or higher (breaking change from previous versions which supported Python 3.4+)
STAR aligner for paired-end insert size calculation
Recommended: conda for installing bioinformatics tools (STAR, samtools)

Install using pip:

pip install geo-upload-tool

Note: For bioinformatics dependencies like STAR and samtools, we recommend using conda as these tools have system-level dependencies that are easier to manage through conda:

conda install -c conda-forge -c bioconda STAR samtools

Getting Started

The quickest way to get started is to use the gut init command to create a new project directory with template files:

gut init my_geo_submission
cd my_geo_submission

This creates a directory with four template files:

sample_info.csv - Template for sample metadata
file_info.csv - Template for file paths and metadata
.env.template - Template for FTP credentials
README.md - Quick start guide

Edit the CSV files with your actual data, following the inline comments and examples. Then proceed with the workflow below.

Basic Usage

The entire process is driven from two CSV files: sample_info and file_info. Described below:

Sample Info

This information makes up the SAMPLES section. The CSV should have exactly one row per sample, and all of the following required columns:

Sample name: unique name for this sample
source name: sample source, e.g. brain
organism: name of the organism, e.g. human
molecule: one of a set of controlled vocabulary, listed below
description (optional): description of the sample, if desired

You may include as many more columns in the file as you like, and they will all be added as characteristic: tag columns under the SAMPLES section.

NB: The Sample name column is used to cross reference with the file info, which is discussed next.

File Info

This information is used to derive the RAW FILES, PROCESSED DATA FILES, and PAIRED-END EXPERIMENTS sections, as well as the processed data file and raw file columns of the SAMPLES section. gut infers whether a file is raw or processed based on the rectype column (see below). The CSV should have at least one raw and at least one processed file per sample in the sample info file (GEO requires this).

The raw files are always fastq files, and there should be one row per fastq file per sample, e.g.:

Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fastq.gz,hg38.fa,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fastq.gz,hg38.fa,2
A2,SE fastq,fastq,Illumina HiSeq 2000,A2_R1.fastq.gz,hg38.fa,1

The processed files may be any other type of file, and there must be at least one processed file for each sample in the sample info file, e.g. continuing from above example:

A1,wig,na,na,A1.wig.gz,na,na
A1,csv,na,ns,raw_counts.csv,na,na
A2,wig,na,na,A2.wig.gz,na,na
A2,csv,na,ns,raw_counts.csv,na,na

Note the same file raw_counts.csv is provided for both samples, since the raw counts matrix often contains processed data for all samples. The CSV file must have all of the following columns with column names in the first row:

Sample name: unique name for this sample, corresponds to sample info
rectype:
- for RAW files: either "PE fastq" or "SE fastq"
- for PROCESSSED files: anything appropriate for the file (e.g. csv, txt, wig, etc)
file type:
- for RAW files: one of a controlled vocabulary, listed below
- for PROCESSED files: value ignored
instrument model:
- for RAW files: one of a controlled vocabulary, listed below required
- for PROCESSED files: value ignored
path: the relative or absolute path to the file on your local system
ref_fa: (optional)
- for RAW files only: a local path or URL to a fasta reference sequence that can be used to compute average insert size and standard deviation for paired-end datasets
- Supports local paths and URLs (http://, https://, ftp://)
- URLs are automatically downloaded and cached in .cache/references/
- Example local path: /path/to/reference.fa
- Example URL: https://ftp.ensembl.org/pub/.../genome.fa.gz
end: required only for rectype == "PE fastq": either 1 or 2 indicating the end of the fastq file
alias: (optional) a clean filename to use in the GEO submission directory instead of the original on-disk filename. When provided and non-empty, the symlink (or copy) created in the output directory will use the alias name, and the alias will appear as the file name in GEO metadata. This is useful when raw pipeline filenames contain run-specific tokens that are irrelevant to the submission.

Example: if the file on disk is DH-WT_IBH1-1_UNUSEFUL_INFO_R1_concat.fa.gz, you can set alias to IBH1-1_R1_concat.fa.gz and it will be uploaded under that name. Alias values must be unique within the submission.

Any additional fields in the file info file are quite friendly ignored.

Validate

With the above CSVs prepared, you can validate them, to make sure everything lines up as expected:

gut validate -o my_geo_submission sample_info.csv file_info.csv

The -o argument is the name of the directory that will be created to stage the files (GEO requires the directory be named the same as your email). The validation logic checks to make sure everything lines up between your samples and files, e.g. make sure all samples are in both files, each sample has both raw and processed files, etc.

Build

Once you have fixed all the problems and validation is successful, you can build the staged directory:

gut build -o my_geo_submission sample_info.csv file_info.csv

This will do the following:

Symlink (or copy with --copy) all of the raw and processed files into the staging directory
Compute md5 checksum on all files
Identify read length and single- or paired-endedness for any fastq files
Calculate the average insert size and standard deviation for paired-end fastq files, using STAR and --ref-fa=FA or ref_fa file_info columns to specify the reference sequence.
Construct a metadata file with SAMPLES, PROCESSED DATA FILES, RAW FILES, and PAIRED-END EXPERIMENTS sections filled out appropriately, saved as an excel file in the staging directory

If all went well, the file metadata_TOFILL.xlsx should exist inside the staging directory. As the TOFILL part suggests, you need to fill it out some more, as the other sections (e.g. SERIES) are not yet complete, unless you provided the other sections with the --addnl CLI flag (see below). I suggest you create a copy named metadata_complete.xlsx or something in the staging directory and fill that out. Be on the lookout for errors and blank fields; I surely didn't think to check for every possible mistake.

If you wish to automate the whole honking process, you may also provide a CSV file with the SERIES, PROTOCOLS, and DATA PROCESSING PIPELINE sections filled out. The file should contain ONLY these sections, with fields taken from the v2.1 template. There is an example file in the root of this repo you may use as a starting point. Once the file is completed, you may provide it to gut with the --addnl CLI option. The resulting metadata_TOFILL.xlsx will have these fields incorporated, and if you were thorough, you might not need to edit it at all. As per below, the metadata files created by gut do not upload by default, so you will still have to copy or rename the metadata file (e.g. to metadata.xlsx) for gut to know to upload it.

Using Reference Genomes from URLs

gut now supports downloading reference FASTA files directly from URLs, eliminating the need to manually download large genome files. This is particularly useful for paired-end insert size calculation.

Supported URL types:

HTTP: http://example.com/genome.fa
HTTPS: https://ftp.ensembl.org/pub/.../genome.fa.gz
FTP: ftp://ftp.ncbi.nlm.nih.gov/.../genome.fa

How it works:

URLs are automatically detected when you provide them via --ref-fa flag or in the ref_fa column
Files are downloaded once and cached in .cache/references/ directory
Subsequent builds reuse the cached file (no re-download)
Both local paths and URLs work interchangeably

Example with CLI flag:

gut build -o my_geo_submission \
  --ref-fa=https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
  sample_info.csv file_info.csv

Example with per-sample URLs in file_info.csv:

Sample name,rectype,file type,instrument model,path,ref_fa,end
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R1.fq.gz,https://example.com/hg38.fa.gz,1
A1,PE fastq,fastq,Illumina HiSeq 2000,A1_R2.fq.gz,https://example.com/hg38.fa.gz,2
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R1.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,1
A2,PE fastq,fastq,Illumina HiSeq 2000,A2_R2.fq.gz,ftp://ftp.ensembl.org/.../mm10.fa,2

Cache management:

Cached files are stored in .cache/references/ within your output directory
Each cached file has a .url sidecar file storing the original URL
To clean cache, simply delete files from .cache/references/
Cache is verified before reuse (URL must match)

Troubleshooting:

Network timeout: Large files (2-3GB) may take 15-30 minutes to download
404 error: Verify the URL is correct and accessible in your browser
Firewall issues: HTTP/HTTPS URLs are generally more reliable than FTP
Manual fallback: You can always download the file manually and provide a local path

Upload

Once you have filled in the missing metadata and put the new file into the staging directory, you are ready to upload. You will have to initiate the upload process from the GEO website and receive an upload directory, e.g. uploads/your@email.edu_mXoLeWqE. An FTP client is built into python and gut uses this to upload just the staged files.

Credential Management

For security, gut no longer accepts passwords as command-line arguments. Instead, credentials can be provided through multiple secure methods:

Method 1: Environment Variables (Recommended for CI/CD)

export GEO_FTP_USER=geousername
export GEO_FTP_PASSWORD=geopassword
gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE

Method 2: .env File (Recommended for Local Development)

Create a .env file in your project directory:

# .env file
GEO_FTP_USER=geousername
GEO_FTP_PASSWORD=geopassword

Make sure to add .env to your .gitignore to avoid committing credentials!

Then run the upload command:

gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE

Method 3: Command Line Flag + Interactive Prompt (Most Secure)

gut upload --user geousername my_geo_submission uploads/your@email.edu_mXoLeWqE
# Password: [you will be prompted securely]

Method 4: Fully Interactive

gut upload my_geo_submission uploads/your@email.edu_mXoLeWqE
# FTP Username: [enter username]
# FTP Password: [enter password securely]

You can get the geousername and geopassword from the GEO website upon initiating an upload. Your submission should be done in a matter of hours to days, depending on how big your data are. Then the iteration begins.

Security Notes:

Passwords are never displayed in logs or terminal output
Interactive prompts use hidden input (getpass)
Environment variables from .env files are never committed to version control
Avoid including credentials in shell scripts that may be shared

NB: gut will upload everything in the staging directory except:

files with TOFILL in the name
the .cache directory, which contains a bunch of stuff gut made for processing the files

You can put other things in there you want to upload if you so desire.

Sometimes upload can fail part way through, especially when uploading many large files. To avoid unnecessary re-uploads, the upload routine checks for the presence of each file on the server before uploading, and if the remote and local file sizes are the same, upload is skipped. You can turn this behavior off and force upload every time by supplying --no-cache to the upload command.

Development

Setup

This project uses uv for dependency management and packaging.

Set up development environment:

# Clone the repository
git clone https://bitbucket.org/bucab/gut
cd gut

# Install dependencies (including dev dependencies)
uv sync --extra dev

# Install with optional bioinformatics dependencies (pysam for BAM/CRAM support)
uv sync --all-extras

Development Workflow

Running tests:

uv run pytest

Code quality checks:

# Format code with Black
uv run black .

# Lint with Ruff
uv run ruff check .

Pre-commit hooks:

This project uses pre-commit hooks to ensure code quality. Install them with:

uv run pre-commit install

The hooks will automatically run on staged files before each commit. To bypass the hooks (not recommended), use:

git commit --no-verify

Build & Release

Build distribution packages:

uv build

This creates both wheel (.whl) and source distribution (.tar.gz) in the dist/ directory.

Detailed Documentation

TODO

Controlled Field Values

molecule

If seq_template_v2.1.xls is to be believed, molecule must be precisely one of:

total RNA
polyA RNA
cytoplasmic RNA
nuclear RNA
genomic DNA
protein
other

rectype

These values are gut-specific, and used to help figure out what to do with the files. The files that end up in the RAW FILES section are:

PE fastq
SE fastq

Anything else ends up in the PROCESSED DATA FILES section (e.g. csv, txt, peak, wig, bed, gff, etc).

file type

These are the accepted filetype values:

fastq
Illumina_native_qseq
Illumina_native
SOLiD_native_csfasta
SOLiD_native_qual
sff
454_native_seq
454_native_qual
Helicos_native
srf
PacBio_HDF5

instrument model

According to seq_template_v2.1.xls, instrument model must be one of:

Illumina Genome Analyzer
Illumina Genome Analyzer II
Illumina Genome Analyzer IIx
Illumina HiSeq 2000
Illumina HiSeq 1000
Illumina MiSeq
Illumina NextSeq
AB SOLiD System
AB SOLiD System 2.0
AB SOLiD System 3.0
AB SOLiD 4 System
AB SOLiD 4hq System
AB SOLiD PI System
AB 5500xl Genetic Analyzer
AB 5500 Genetic Analyzer
454 GS
454 GS 20
454 GS FLX
454 GS Junior
454 GS FLX Titanium
Helicos HeliScope
PacBio RS
PacBio RS II
Complete Genomics
Ion Torrent PGM

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

labadorf

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.3

Mar 20, 2026

2.2.2

Mar 20, 2026

This version

2.2.1

Mar 20, 2026

1.0.8

Jul 9, 2021

1.0.7

Oct 12, 2020

1.0.6

Oct 12, 2020

1.0.5

Oct 11, 2020

1.0.4

Oct 11, 2020

1.0.3

Oct 10, 2020

1.0.2

Aug 12, 2020

1.0.1

May 27, 2020

1.0.0

Jan 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geo_upload_tool-2.2.1.tar.gz (127.7 kB view details)

Uploaded Mar 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geo_upload_tool-2.2.1-py3-none-any.whl (126.3 kB view details)

Uploaded Mar 20, 2026 Python 3

File details

Details for the file geo_upload_tool-2.2.1.tar.gz.

File metadata

Download URL: geo_upload_tool-2.2.1.tar.gz
Upload date: Mar 20, 2026
Size: 127.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geo_upload_tool-2.2.1.tar.gz
Algorithm	Hash digest
SHA256	`14931a685b2016fdc951e621a834a9f32c7fcffb0361b22a097689091d92ca5f`
MD5	`9e38f104f36f660391cb872fa9f30060`
BLAKE2b-256	`963c0d3745448797ab51686e140b1c1535677a00308fb9a8b7e45517a0277098`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geo_upload_tool-2.2.1.tar.gz:

Publisher: publish.yml on BU-Neuromics/gut

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geo_upload_tool-2.2.1.tar.gz
- Subject digest: 14931a685b2016fdc951e621a834a9f32c7fcffb0361b22a097689091d92ca5f
- Sigstore transparency entry: 1149760206
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: BU-Neuromics/gut@6b5625c0b75ec0f2ca85237974af74593cd41a7e
- Branch / Tag: refs/tags/v2.2.1
- Owner: https://github.com/BU-Neuromics
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6b5625c0b75ec0f2ca85237974af74593cd41a7e
- Trigger Event: push

File details

Details for the file geo_upload_tool-2.2.1-py3-none-any.whl.

File metadata

Download URL: geo_upload_tool-2.2.1-py3-none-any.whl
Upload date: Mar 20, 2026
Size: 126.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geo_upload_tool-2.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a2d2bcfa5583fde1ecc981cd50a43a2cb1fe00edd78ca26f17806dd0f9b20774`
MD5	`ec04719df97e139c51e2c962c104cdeb`
BLAKE2b-256	`a0626435059201f229ff58c1e1010d3d26a53671de5e29a5f2843d224eeb9023`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geo_upload_tool-2.2.1-py3-none-any.whl:

Publisher: publish.yml on BU-Neuromics/gut

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geo_upload_tool-2.2.1-py3-none-any.whl
- Subject digest: a2d2bcfa5583fde1ecc981cd50a43a2cb1fe00edd78ca26f17806dd0f9b20774
- Sigstore transparency entry: 1149760244
- Sigstore integration time: Mar 20, 2026
Source repository:
- Permalink: BU-Neuromics/gut@6b5625c0b75ec0f2ca85237974af74593cd41a7e
- Branch / Tag: refs/tags/v2.2.1
- Owner: https://github.com/BU-Neuromics
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6b5625c0b75ec0f2ca85237974af74593cd41a7e
- Trigger Event: push

geo-upload-tool 2.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

gut - gEO upload tool

Installation & Prerequisites

Getting Started

Basic Usage

Sample Info

File Info

Validate

Build

Using Reference Genomes from URLs

Upload

Credential Management

Development

Setup

Development Workflow

Build & Release

Detailed Documentation

Controlled Field Values

molecule

rectype

file type

instrument model

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance