ZORP: A helpful GWAS parser

These details have not been verified by PyPI

Project links

Project description

ZORP: A helpful GWAS parser

Build Status

Why?

ZORP is intended to abstract away differences in file formats, and help you work with GWAS data from many different sources.

Provide a single unified interface to read text, gzip, or tabixed data
Separation of concerns between reading and parsing (with parsers that can handle the most common file formats)
Includes helpers to auto-detect data format and filter for variants of interest

Why not?

ZORP provides a high level abstraction. This means that it is convenient, at the expense of speed.

For GWAS files, ZORP does not sort the data for you, because doing so in python would be quite slow. You will still need to do some basic data preparation before using.

Installation

By default, zorp installs with as few python dependencies as practical. For more performance, and to use special features, install the additional required dependencies as follows:

$ pip install zorp[perf,lookups]

The snp-to-rsid lookup requires a very large file in order to work efficiently. You can download the pre-generated file using the zorp-assets command line script, as follows. (use "--no-update" to skip warnings about already having the latest version)

$ zorp-assets download --type snp_to_rsid --tag genome_build GRCh37  --no-update
$ zorp-assets download --type snp_to_rsid --tag genome_build GRCh37

Or build it manually (which may require first downloading a large source file): $ zorp-assets build --type snp_to_rsid --tag genome_build GRCh37

Assets will be downloaded to the least user-specific location available, which may be overridden by setting the environment variable ZORP_ASSETS_DIR. Run zorp-assets show --all to see the currently selected asset directory.

A note on rsID lookups

When developing on your laptop, you may not wish to download 16 GB of data per rsID lookup. A much smaller "test" dataset is available, which contains rsID data for a handful of pre-selected genes of known biological functionality.

$ zorp-assets download --type snp_to_rsid_test --tag genome_build GRCh37

To use it in your python script, simply add an argument to the SnpToRsid constructor:

rsid_finder = lookups.SnpToRsid('GRCh37', test=True)

If you have generated your own lookup using the code in this repo (make_rsid_lookup.py), you may also replace the genome build with a hardcoded path to the LMDB file of lookup data. This use case is fairly uncommon, however.

Usage

Python

from zorp import lookups, readers, parsers

# Create a reader instance. This example specifies each option for clarity, but sniffers are provided to auto-detect 
#   common format options.
sample_parser = parsers.GenericGwasLineParser(marker_col=1, pvalue_col=2, is_neg_log_pvalue=True,
                                              delimiter='\t')
reader = readers.TabixReader('input.bgz', parser=sample_parser, skip_rows=1, skip_errors=True)

# After parsing the data, values of pre-defined fields can be used to perform lookups for the value of one field
#  Lookups can be reusable functions with no dependence on zorp
rsid_finder = lookups.SnpToRsid('GRCh37')

reader.add_lookup('rsid', lambda variant: rsid_finder(variant.chrom, variant.pos, variant.ref, variant.alt))

# Sometimes a more powerful syntax is needed- the ability to look up several fields at once, or clean up parsed data 
#   in some way unique to this dataset 
reader.add_transform(lambda variant: mutate_entire_variant(variant))

# We can filter data to the variants of interest. If you use a domain specific parser, columns can be referenced by name
reader.add_filter('chrom', '19')  # This row must have the specified value for the "chrom" field
reader.add_filter(lambda row: row.neg_log_pvalue > 7.301)  # Provide a function that can operate on all parsed fields
reader.add_filter('neg_log_pvalue')  # Exclude values with missing data for the named field  

# Iteration returns containers of cleaned, parsed data (with fields accessible by name).
for row in reader:
    print(row.chrom)

# Tabix files support iterating over all or part of the file
for row in reader.fetch('X', 500_000, 1_000_000):
    print(row)

# Write a compressed, tabix-indexed file containing the subset of variants that match filters, choosing only specific 
#   columns. The data written out will be cleaned and standardized by the parser into a well-defined format. 
out_fn = reader.write('outfile.txt', columns=['chrom', 'pos', 'pvalue'], make_tabix=True)

# Real data is often messy. If a line fails to parse, the problem will be recorded.
for number, message, raw_line in reader.errors:
    print('Line {} failed to parse: {}'.format(number, message))

Command line file conversion

The file conversion feature of zorp is also available as a command line utility. See zorp-convert --help for details and the full list of supported options.

This utility is currently in beta; please inspect the results carefully.

To auto-detect columns based on a library of commonly known file formats:

$ zorp-convert --auto infile.txt --dest outfile.txt --compress

Or specify your data columns exactly:

$ zorp-convert infile.txt --dest outfile.txt --index --skip-rows 1 --chrom_col 1 --pos_col 2 --ref_col 3 --alt_col 4 --pvalue_col 5 --beta_col 6 --stderr_beta_col 7 --allele_freq_col 8

The --index option requires that your file be sorted first. If not, you can tabix the standard output format manually as follows.

$ (head -n 1 <filename.txt> && tail -n +2 <file> | sort -k1,1 -k 2,2n) | bgzip > <filename.sorted.gz>
$ tabix <filename.sorted.gz> -p vcf

Development

To install dependencies and run in development mode:

pip install -e '.[test,perf,lookups]'

To run unit tests, use

$ flake8 zorp
$ mypy zorp
$ pytest tests/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.8

Jul 18, 2022

0.3.7

Jun 20, 2022

0.3.6

May 28, 2022

0.3.5

May 27, 2022

0.3.4

Dec 17, 2021

0.3.3

Jul 13, 2020

0.3.2

Jun 5, 2020

0.3.1

Jun 5, 2020

0.3.0

Apr 17, 2020

0.2.0

Jan 20, 2020

0.1.1

Dec 18, 2019

0.1.0

Oct 7, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zorp-0.3.8.tar.gz (32.0 kB view details)

Uploaded Jul 18, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

zorp-0.3.8-py3-none-any.whl (32.4 kB view details)

Uploaded Jul 18, 2022 Python 3

File details

Details for the file zorp-0.3.8.tar.gz.

File metadata

Download URL: zorp-0.3.8.tar.gz
Upload date: Jul 18, 2022
Size: 32.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.4

File hashes

Hashes for zorp-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`1b8908a17ffa8f8d43c611601612622d7dcbe2c08ff1f0e53b0e47cf0dbf1f13`
MD5	`1c63c653f63760bd479139e60d3cb3db`
BLAKE2b-256	`c02c8b4d91ed231486336b1d2472777e68676b91976ab610a1966434666c9fd8`

See more details on using hashes here.

File details

Details for the file zorp-0.3.8-py3-none-any.whl.

File metadata

Download URL: zorp-0.3.8-py3-none-any.whl
Upload date: Jul 18, 2022
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.4

File hashes

Hashes for zorp-0.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61cffee008ca441cf477813a2b4ebe2bfa0cef3db953aafb8116fd13dbce525f`
MD5	`f32674fdc1c0f7ad3970aa6ada480447`
BLAKE2b-256	`4cbc0b0b47bdeb80ed7348d9b7f878a975a3092630dcf5fb3f2e874212f2cf82`

See more details on using hashes here.

zorp 0.3.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ZORP: A helpful GWAS parser

Why?

Why not?

Installation

A note on rsID lookups

Usage

Python

Command line file conversion

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes