Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ncbi_counts

Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.

If you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.

Installation

From PyPI:

pip install ncbi-counts

Usage

python -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE

Options

positional arguments:
  FILE                  Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').

options:
  -h, --help            show this help message and exit
  -n NORM, --norm-type NORM
                        Normalization type of counts (choices: None, fpkm, tpm, default: None)
  -a ANNOT_VER, --annot-ver ANNOT_VER
                        Annotation version of counts (default: GRCh38.p13)
  -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]
                        Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)
  -s SRC_DIR, --src-dir SRC_DIR
                        A directory to save the source obtained from NCBI (default: ./)
  -o OUTPUT, --output OUTPUT
                        A directory to save the count matrix (or matrices) (default: ./)
  -q, --silent          If True, suppress warnings (default: False)
  -S SEP, --sep SEP     Separator between group and GSM in column (default: -)
  -y GSM_YAML, --yaml GSM_YAML
                        Path to save YAML file which contains GSMs (default: None)
  -c, --cleanup         If True, remove source files (default: False)

Command-line Example

To create a mock vs. CoV2 comparison pair for each tissues from GSE164073, please prepare the following yaml file (but do not need words beginning with "!!" as they are type hints):

[!NOTE] The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the Sample Attributes table or SOFT download section in SOFT submission instructions page. You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.

If you want a comprehensive list of attributes for all samples in a series, GEOparse library is useful.
 import GEOparse
 GEOparse.get_GEO("GSExxxxx").phenotype_data

GSE164073: !!seq
- control: !!map
    title: !!str Cornea
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Cornea
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Limbus
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Limbus
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Sclera
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Sclera
    characteristics_ch1: !!str SARS-CoV-2

or if you would like to specify the GSM directly, please prepare the following yaml file:

GSE164073: !!seq
- control: !!map
    geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$
  treatment: !!map
    geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$
- control: !!map
    geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$
  treatment: !!map
    geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$
- control: !!map
    geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$
  treatment: !!map
    geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$

and run the following command ("Symbol" column is kept in this expample):

python -m ncbi_counts sample_regex.yaml -k Symbol -c

then you will get the following files:

GSE164073-1.tsv

GeneID	Symbol	control-GSM4996084	control-GSM4996085	control-GSM4996086	treatment-GSM4996088	treatment-GSM4996087	treatment-GSM4996089
1	A1BG	144	197	157	156	133	122
2	A2M	254	276	262	178	153	178
3	A2MP1	1	0	2	0	0	0
9	NAT1	97	133	103	83	93	88
...	...	...	...	...	...	...	...

GSE164073-2.tsv

GeneID	Symbol	control-GSM4996092	control-GSM4996091	control-GSM4996090	treatment-GSM4996095	treatment-GSM4996094	treatment-GSM4996093
1	A1BG	175	167	203	143	145	145
2	A2M	261	158	427	215	145	169
3	A2MP1	0	0	0	0	0	2
9	NAT1	122	100	133	90	78	80
...	...	...	...	...	...	...	...

GSE164073-3.tsv

GeneID	Symbol	control-GSM4996098	control-GSM4996097	control-GSM4996096	treatment-GSM4996099	treatment-GSM4996100	treatment-GSM4996101
1	A1BG	158	115	140	136	124	145
2	A2M	3337	2261	2536	1524	1288	1807
3	A2MP1	0	0	0	0	0	0
9	NAT1	83	64	68	65	52	79
...	...	...	...	...	...	...	...

If you don't need source files from NCBI, please delete the following files:

Example in Python

To get the output as a pandas DataFrame, please refer to the following code:

from ncbi_counts import Series

series = Series(
    "GSE164073",
    [
        {
            "control": {"title": "Cornea", "characteristics_ch1": "mock"},
            "treatment": {"title": "Cornea", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"title": "Limbus", "characteristics_ch1": "mock"},
            "treatment": {"title": "Limbus", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"geo_accession": "^GSM499609[6-8]$"},
            "treatment": {"geo_accession": "^GSM4996099$|^GSM4996100$|^GSM4996101$"},
        },
    ],
    keep_annot=["Symbol"],
    save_to=None,
)
series.generate_pair_matrix()
# series.cleanup()  # remove source files
series.pair_count_list[0]  # Corresponds to GSE164073-1.tsv
series.pair_count_list[1]  # Corresponds to GSE164073-2.tsv
series.pair_count_list[2]  # Corresponds to GSE164073-3.tsv

License

ncbi_counts is released under an MIT license.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.0

Jun 4, 2024

0.1.0

Dec 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncbi_counts-0.2.0.tar.gz (13.7 kB view hashes)

Uploaded Jun 4, 2024 Source

Built Distribution

ncbi_counts-0.2.0-py3-none-any.whl (13.0 kB view hashes)

Uploaded Jun 4, 2024 Python 3

Hashes for ncbi_counts-0.2.0.tar.gz

Hashes for ncbi_counts-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9414e707974148fa7b24d57fbe0a1e4c295a6a91969447921c234d86d0524202`
MD5	`6f77ccee5c14ba89dc0bbba4e715c9b8`
BLAKE2b-256	`6b86aaa903c47fca3da2a27f7c62032ad1fc21918a730454dfd81af764b094ea`

Hashes for ncbi_counts-0.2.0-py3-none-any.whl

Hashes for ncbi_counts-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecec3bb7e01a6aab54cc6c459145e371f79b53aaa372a3e7d91916ae99f1dae8`
MD5	`5b8892ad4b65a287c3d10deb6f2f54bc`
BLAKE2b-256	`8f57806a8093a26f689b128f5da050934f1d1fb7d2c913d25fe25dbd45cf931c`