The OmicIDX project collects, reprocesses, and then republishes metadata from multiple public genomics repositories. Included are the NCBI SRA, Biosample, and GEO databases. Publication is via the cloud data warehouse platform Bigquery, a set of performant search and retrieval APIs, and a set of json-format files for easy incorporation into other projects.
Project description
New process
Steps
- Download xml
- Create basic json
- Upload json to s3
- munge basic json to parquet
- munge parquet to
- experiment joined
- sample joined
- run joined
- study with aggregates
- Include aggs in spark jobs:
- number of samples, experiments, runs
- sample, experiment, and run accessions (as array)
- Save munged spark data (json, parquet)
- Create elasticsearch index mappings
- Drop existing elasticsearch mappings
- Load elasticsearch index mappings
lambda
zip lambdas.zip lambda_handlers.py sra_parsers.py
aws lambda create-function --function-name sra_to_json --zip-file fileb://lambdas.zip --handler lambda_handlers.lambda_return_full_experiment_json --runtime python3.6 --role arn:aws:iam::377200973048:role/lambda_s3_exec_role
Invoke
aws lambda invoke --function-name sra_to_json --log-type Tail --payload '{"accession":"SRX000273"}' /tmp/abc.txt
Concurrency
1000 total, reserve for certain functions to limit, etc.
aws lambda put-function-concurrency --function-name sra_to_json --reserved-concurrent-executions 20
timeout and memory
aws lambda update-function-configuration --function-name sra_to_json --timeout 15
logging
https://github.com/jorgebastida/awslogs
awslogs get /aws/lambda/sra_to_json ALL --watch
dynamodb
aws dynamodb scan --table-name sra_experiment --select "COUNT"
GEO
python -m omicidx.geometa --gse=GSE10
Will print json, one "line" per entity to stdout.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.