Skip to main content

Collect information from NCBI for the https://github.com/HelikarLab/FastqToGeneCounts project

Project description

GEO Collector

PyPI - Version GitHub Workflow Status (with event) PyPI - Python Version Coveralls branch

Description

GEOcollector is a Python package for collecting metadata about gene expression datasets from the NCBI Gene Expression Omnibus (GEO) database. It will convert a list of GSM accession numbers and cell types into the information required for FastqToGeneCounts to process the raw RNA-seq data.

Long story short, given an input file like this:

GSM,cell_type
GSM3785334,baso
GSM3898581,baso

GEOcollector will output a file like this (without formatted columns):

GSE       ,GSM        ,SRR        ,Rename    ,Strand ,Prep Method ,Platform Code ,Platform Name                      ,Source           ,Cell Characteristics                                                                                                                                                                                                                        ,Replicate Name                                                        ,Strategy ,Publication ,Extra Notes
GSE131525 ,GSM3785334 ,SRR9097791 ,baso_S1R1 ,SE     ,total       ,GPL16791      ,Illumina HiSeq 2500 (Homo sapiens) ,B                ,subject - disease status: Screened Healthy Control;subject: HC3;age at draw: 55;Sex: Female;median cv coverage: 0.763618;fastq total reads: 6241803;unpaired reads examined: 5663490;unpaired read duplicates: 1597507;primary race: White; ,lib3945                                                               ,RNA-Seq  ,31671072    ,
GSE133028 ,GSM3898581 ,SRR9328889 ,baso_S2R1 ,PE     ,total       ,GPL20301      ,Illumina HiSeq 4000 (Homo sapiens) ,peripheral blood ,cell type: peripheral blood B cells;                                                                                                                                                                                                        ,Patient 2 IgD-CD27- double negative B cells from the peripheral blood ,RNA-Seq  ,32859762    ,
GSE133028 ,GSM3898591 ,SRR9328899 ,baso_S2R2 ,PE     ,total       ,GPL20301      ,Illumina HiSeq 4000 (Homo sapiens) ,peripheral blood ,cell type: peripheral blood B cells;                                                                                                                                                                                                        ,Patient 3 IgD-CD27- double negative B cells from the peripheral blood ,RNA-Seq  ,32859762    ,

Installation

To install GEOcollector, you can use pip:

pip install GEOcollector

Usage

The following sections are command line parameters associated with GEOcollector

Command Line Interface

To execute GEOcollector, simply call it from the command line with the relevant parameters

geocollector --api-key APIKEY --input-file /home/user/input.csv --verbose
geocollector --input-file /home/user/input.csv --quiet
geocollector --api-key APIKEY --input-file /home/user/input.csv

To view help for GEOcollector, run the following command

geocollector --help

API Key

Without an API key, NCBI limits the number of requests to 3 per second. With an API key, this value is increased to 10 requests per second. To obtain an API key, follow the below steps

  1. Access NCBI's website
  2. Click "Log In" in the top right corner
    1. If you do not have an account, create one now
  3. Click your username in the top right corner
  4. Click "Account settings" in the dropdown menu
  5. Scroll down to the "API Key" section
  6. Click "Create API Key"
  7. Copy the API key that has been created

Input file

The input file should be a CSV file in the following format. Multiple GSMs can be associated to a single cell type

GSM,cell_type
GSM_1,cell_type_1
GSM_2,cell_type_1
GSM_3,cell_type_1
GSM_4,cell_type_2

Verbosity

If you would like to show debug information on the command line, pass the flag --verbose. If you would like to silence all output (except warnings), pass the flag --quiet. If neither flag is passed, standard "info" messages will be shown


If you have problems, please create a new issue

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geocollector-1.1.3.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

geocollector-1.1.3-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file geocollector-1.1.3.tar.gz.

File metadata

  • Download URL: geocollector-1.1.3.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.0 Linux/6.2.0-1012-azure

File hashes

Hashes for geocollector-1.1.3.tar.gz
Algorithm Hash digest
SHA256 354ff8ca456853f51238baa4724f9fdc03fb08107c68f3b593a842ec4d4e0269
MD5 c941bd2e1ce141f05a408ed828ad1ddb
BLAKE2b-256 76f1b1d3e4e55675cf41a264454520d229a49aad856e9e33461db856242c1c67

See more details on using hashes here.

File details

Details for the file geocollector-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: geocollector-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.0 Linux/6.2.0-1012-azure

File hashes

Hashes for geocollector-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b0596ee0c7007209deba9c5a0efc7f561e415c789984ba9c68e6e12aba413daf
MD5 827d4e2a23a4d655594f9a3f5979fb5a
BLAKE2b-256 96eb5e8a4568adb8c37886b8de0643e7b8423641e832be5054f4fb1f299f8f5c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page