a Python library and command line tool to make GEO data into gold.
Project description
geo-alchemy
a Python library and command line tool to make GEO data into gold.
why geo-alchemy
GEO is like a gold mine that contains a huge many gold ore. But processing these gold ore(GEO series) into gold(expression matrix, clinical data) is not very easy:
- how to map microarray probe to gene?
- how about multiple probes map to same gene?
- hot to get clinical data?
- ...
geo-alchemy was born to deal with it.
installation
If you only want use as Python library:
pip install geo-alchemy
If you also want use as command line software:
pip install 'geo-alchemy[cmd]'
use as Python library
parse metadata from GEO
parse platform
from geo_alchemy import PlatformParser
parser = PlatformParser.from_accession('GPL570')
platform1 = parser.parse()
# or
platform2 = PlatformParser.from_accession('GPL570').parse()
print(platform1 == platform2)
# get platform annotation data
platform = PlatformParser.from_accession('GPL570', view='full').parse()
print(platform.internal_data)
parse sample
from geo_alchemy import SampleParser
parser = SampleParser.from_accession('GSM1885279')
sample1 = parser.parse()
# or
sample2 = SampleParser.from_accession('GSM1885279').parse()
print(sample1 == sample2)
parse series
from geo_alchemy import SeriesParser
parser = SeriesParser.from_accession('GSE73091')
series1 = parser.parse()
# or
series2 = SeriesParser.from_accession('GSE73091').parse()
print(series1 == series2)
print(series1.platforms)
print(series1.samples)
print(series1.organisms)
serialization and deserialization
For the convenience of saving, all objects in geo-alchemy can be converted to dict, and this dict can be directly saved to a file in json form.
Moreover, geo-alchemy also provides methods to convert these dicts into objects.
from geo_alchemy import SeriesParser
series1 = SeriesParser.from_accession('GSE73091').parse()
data = series1.to_dict()
series2 = SeriesParser.parse_dict(data)
print(series1 == series2)
use as command line software
using OCM
OCM(object command mapping) is a Python framework mapping Python object to command line software. It can capture intermediate results of command, you can enable OCM output like this:
geo-alchemy xxx --ocmir
probe reannotation
Prerequisites:
- NCBI BLAST must be installed.
- BLAST Index must be generated.
for more details, refer to this page.
geo-alchemy -d reanno -p GPL15303 -s 9 -d /Users/dev/Data/blast-indexes/GRCh38.p13/GRCh38.p13
-p GPL15303
probe reannotation for GPL15303-s 9
the 9th column of platform annotation file is probe sequence-d xxx
blast indexes location
if your reference sequences are download from GENCODE, enable --gencode
can extract gene symbol from gene ID:
geo-alchemy -d reanno -p GPL15303 -s 9 -d /Users/dev/Data/blast-indexes/GRCh38.p13/GRCh38.p13 --gencode
preprocessing(microarray series only)
download metadata using network:
geo-alchemy pp -s GSE174772 -p GPL570 -g 11
-s GSE174772
preprocessing for GSE174772-p GPL570
preprocessing samples who use GPL570 of GSE174772-g 11
NO.11 column of GPL570 annotation file is gene
this command generate 2 files under current directory:
- clinical file
GSE174772_clinical.txt
- gene expression file
GSE174772_expression.txt
use existed series metadata:
import json
from geo_alchemy import SeriesParser
series = SeriesParser.from_accession('GSE174772').parse()
data = series.to_dict()
with open('GSE174772.json', 'w') as fp:
json.dump(data, fp)
geo-alchemy pp -sf GSE174772.json -g 11
using existing probe gene mapping file.
usually you use geo-alchemy reanno
do probe reannotation,
this make you get a probe gene mapping file, you can:
geo-alchemy reanno -p GPL6480 -s 17 -d /Users/dev/Data/blast-indexes/GRCh38.p13/GRCh38.p13 --gencode
geo-alchemy pp -s GSE12435 -m GPL6480_reanno.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for geo_alchemy-0.0.20-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 498bf270424dccf63bd0887b24cbe9759f53a3f0072c2a662724aab37aa44acb |
|
MD5 | bda626df9f7a3d8e3f2b28b0b02392d8 |
|
BLAKE2b-256 | d1e5671456325ac5dc7cc5783506bf171738949d77e58929d1d62045df344de9 |