MVTest - GWAS Analysis
- Install with PIP
- Manual Installation
- System Requirements
- Running Unit Tests
- Virtual Env
- What is MVtest?
- Command-Line Arguments
- mvmany Helper script
- The Default Template
- Command Line Arguments
- Development Notes
- MVtest authors
- Change Log
MVtest requires python 2.7.x as well as the following libraries:
- NumPy (version 1.7.2 or later) www.numpy.org
- SciPY (version 0.13.2 or later) www.scipy.org
MVTest’s installation will attempt to install these required components for you, however, it requires that you have write permission to the installation directory. If you are using a shared system and lack the necessary privileges to install libraries and software yourself, you should please see one of the sections, Miniconda or virtual-env below for instructions on different options for setting up your own python environement which will exist entirely under your own control.
Installation can be done in two ways:
Install with PIP
To install using python’s package manager, pip, simply use the following command:
$ pip install MVtest
If you have proper permission to install packages, this will attempt to download and install all dependencies along with MVtest itself.
For users who do not use pip or wish to run the bundled tests as well as have a local copy of the manuals, manual installation is almost as easy.
For users with Git installed, you can simply clone the sources using the following command:
$ git clone https://github.com/edwards-lab/MVtest
Or you may visit the website and download the tarball directly from github: https://github.com/edwards-lab/MVtest
Once you have downloaded the software, simply extract the contents and run the following command to install it:
$ python setup.py install
If no errors are reported, it should be installed and ready to use.
Regarding PYTHON 3 I began the process of updating the code to work with both python versions 2 and 3, however, there are some real issues with some library support of version 3 that is discouraging. So, until those have been resolved, I have no plans to invest further time toward support for python 3.
Aside from the library dependencies, MVTest’s requirements depend largely on the number of SNPs and individuals being analyzed as well as the data format being used. In general, GWAS sized datasets will require several gigabytes of memory when using the traditional pedigree format, however, even 10s of thousands of subjects can be analyzed with less than 1 gigabyte of RAM when the data is formatted as transposed pedigree or PLINK’s default bed format.
Otherwise, it is recommended that the system be run on a unix-like system such as Linux or OS X, but it should work under windows as well (we can’t offer support for running MVTest under windows).
Running Unit Tests
MVTest comes with a unit test suite which can be run prior to installation. To run the tests, simply run the following command from within the root directory of the extracted archive’s contents:
$ python setup.py test
If no errors are reported, then mvtest should run correctly on your system.
Virtual ENV is a powerful too for python programmers and end users alike as it allows for users to deploy different versions of python applications without the need for root access to the machine.
Because MVTest requires version 2.7, you’ll need to ensure that your machine’s python version is in compliance. Virtual Env basically uses the the system version of python, but creates a user owned environment wrapper allowing users to install libraries easily without administrative rights to the machine.
For a helpful introduction to VirtualEnv, please have a look at the tutorial: http://www.simononsoftware.com/virtualenv-tutorial/
Miniconda is a minimal version of the package manager used by the Anaconda python distribution. It makes it easy to create local installations of python with the latest versions of the common scientific libraries for users who don’t have root access to their target machines. Basically, when you use miniconda, you’ll be installing your own version of Python into a directory under your control which allows you to install anything else you need without having to submit a helpdesk ticket for administrative assistance.
Unlike pip, the folks behind the conda distributions provide binary downloads of it’s selected library components. As such, only the most popular libraries, such as pip, NumPY and SciPy, are supported by conda itself. However, these do not require compilation and may be easier to get installed than when using pip alone. I have experienced difficulty installing SciPy through pip and setup tools on our cluster here at vanderbilt due to non-standard paths for certain required components, but mini-conda always comes through.
Firstly, download and install the appropriate version of miniconda at the project website. Please be sure to choose the Python 2 version: http://conda.pydata.org/miniconda.html
While it is doing the installation, please allow it to update your PATH information. If you prefer not to always use this version of python in the future, simple tell it not to update your .bashrc file and note the instructions for loading and unloading your new python environment. Please note that even if you chose to update your .bashrc file, you will need to follow directions for loading the changes into your current shell.
Once those changes have taken effect, install setuptools and scipy: $ conda install pip scipy
Installing SciPy will also force the installation of NumPy, which is also required for running mvtest. (setuptools includes easy_install).
Once that has been completed successfully, you should be ready to follow the standard instructions for installing mvtest.
What is MVtest?
TODO: Write some background information about the application and it’s scientific basis.
Documentation for MVtest is still under construction. However, the application provides reasonable inline help using standard unix help arguments:
> mvtest.py -h
> mvtest.py –help
In general, overlapping functionality should mimic that of PLINK.
Command line arguments used by MVtest often mimick those used by PLINK, except where there is no matching functionality (or the functionality differs significantly.)
For the parameters listed below, when a parameter requires a value, the value must follow the argument with a single space separating the two (no ‘=’ signs.) For flags with no specified value, passing the flag indicates that condition is to be “activated”.
When there is no value listed in the “Type” column, the arguments are off by default and on when the argument is present (i.e. by default, compression is turned off except when the flag, –compression, has been provided.)
|-h, --help||Show this help message and exit.|
|-v||Print version number|
MVtest attempts to mimic the interface for PLINK where appropriate.
All input files should be whitespace delimited. For text based allelic annotations, 1|2 and A|C|G|T annotation is sufficient. All data must be expressed as alleles, not as genotypes (except for IMPUTE output, which is a specialized format that is very different from the other forms).
For Pedigree, Transposed Pedigree and PLINK binary pedigree files, the using the PREFIX arguments is sufficient and recommended if your files follow the standard naming conventions.
Pedigree data is fully supported, however it is not recommended. When loading pedigree data, MVtest must load the entire dataset into memory prior to analysis, which can result in a substantial amount of memory overhead that is unnecessary.
Flags like –no-pheno and –no-sex can be used in any combination creating MAP files with highly flexible header structures.
|(filename prefix) Prefix for .ped and .map files|
|PLINK compatible .ped file|
|PLink compatible .map file|
|--map3||Map file has only 3 columns|
|--no-sex||Pedigree file doesn’t have column 5 (sex)|
|--no-parents||Pedigree file doesn’t have columns 3 and 4 (parents)|
|--no-fid||Pedgiree file doesn’t have column 1 (family ID)|
|--no-pheno||Pedigree file doesn’t have column 6 (phenotype)|
|--liability||Pedigree file has column 7 (liability)|
Transposed Pedigree Data
Transposed Pedigree data is similar to standard pedigree except that the data is arranged such that the data is organized as SNPs as rows, instead of individuals. This allows MVtest to run it’s analysis without loading the entire dataset into memory.
|Prefix for .tped and .tfam files|
|Transposed Pedigre file (.tped)|
|Transposed Pedigree Family file (.tfam)|
Pedigree/Transposed Pedigree Common Flags
By default, Pedigree and Transposed Pedigree data is assumed to be uncompressed. However, MVtest can directly use gzipped data files if they have the extension .tgz with the addition of the –compressed argument.
|--compressed||Indicate that ped/tped files have been compressed with gzip and are named with extensions such as .ped.tgz and .tped.tgz|
MVtest doesn’t call genotypes when performing analysis and allows users to define which model to use when analyzing the data. Due to the fact that there is no specific location for chromosome within the input files, MVtest requires that users provide chromosome, impute input file and the corresponding .info file for each imputed output.
Due to the huge number of expected loci, MVtest allows users to specify an offset and file count for analysis. This is to allow users to run multiple jobs simultaneously on a cluster and work individually on separate impute region files. Users can segment those regions even further using standard MVtest region selection as well.
By default, all imputed data is assumed to be compressed using gzip.
Default naming convention is for impute data files to end in .gen.gz and the info files to have the same name except for the end being replaced by .info.
|File containing list of impute output for analysis|
|File containing family details for impute data|
|Impute file index (1 based) to begin analysis|
|Number of impute files to process (for this node). Defaults to all remaining.|
|Indicate that the impute input is not gzipped, but plain text|
(additive,dominant or recessive)
Genetic model to be used when analyzing imputed data.
|Portion of filename denotes info filename|
|Portion of filename that denotes gen file|
|Threshold for filtering imputed SNPs with poor ‘info’ values|
IMPUTE File Input
When performing an analysis on IMPUTE output, users must provide a single file which lists each of the gen files to be analyzed. This plain text file contains 2 (or optionally 3) columns for each gen file:
|Chromosome||Gen File||.info <filename> (optional)|
The 3rd column is only required if your .info files and .gen files are not the same except for the <extension>.
Users can analyze data imputed with MACH. Because most situations require many files, the format is a single file which contains either pairs of dosage/info files, or, if the two files share the same filename except for extensions, one dosage file per line.
- Important: MACH doesn’t provide anywhere to store chromosome and
- positions. Users may wish to embed this information into the first column inside the .info file. Doing so will allow MVtest to recognize those values and populate the corresponding fields in the report.To use this feature, users much use the –mach-chrpos field and their ID columns inside the .info file must be formatted in the following way:chr:pos (optionally :rsid)When the –mach-chrpos flag is used, MVtest will fail when it encounters IDs that aren’t in this format and there must be at least 2 ‘fields’ (i.e. there must be at least one “:” character.When processing MACH imputed data without this special encoding of IDs, MCtest will be unable to recognize positions. As a result, unless the –mach-chrpos flag is present, MVtest will exit with an error if the user attempts to use positional filters such as –from-bp, –chr, etc.
When running MVtest using MACH dosage on a cluster, users can instruct a given job to anlyze data from a portion of the files contained within the MACH dosage file list by changing the –mach-offset and –mach-count arguments. By default, the offset starts with 1 (the first file in the dosage list) and runs all it finds. However, if one were to want to split the jobs up to analyze three dosage files per job, they might set those values to –mach-offset 1 –mach-count 3 or –mach-offset 4 –mach-count 3 depending on which job is being defined.
In order to minimize memory requirements, MACH dosage files can be loaded incrementally such that only N loci are stored in memory at a time. This can be controlled using the –mach-chunk-size argument. The larger this number is, the faster MVtest will run (fewer times reading from file) but the more memory is required.
|File containing list of dosages, one per line. Optionally, lines may contain the info names as well (separated by whitespace) if the two <filename>s do not share a common base name.|
|Index into the MACH file to begin analyzing|
|Number of dosage files to analyze|
|By default, MACH input is expected to be gzip compressed. If data is plain text, add this flag. It should be noted that dosage and info files should be either both compressed or both uncompressed.|
|Due to the individual orientation of the data, large dosage files are parsed in chunks in order to minimize excessive memory during loading|
|Indicate the <extension> used by the mach info files|
|Indicate the <extension> used by the mach dosage files|
|Indicate the minimum threshold for the rsqured value from the .info files required for analysis.|
|--mach-chrpos||When set, MVtest expects IDs from the .info file to be in the format chr:pos:rsid (rsid is optional). This will allow the report to contain positional details, otherwise, only the RSID column will have a value which will be the contents of the first column from the .info file|
MACH File Input
When running an analysis on MACH output, users must provide a single file which lists of each dosage file and (optionally) the matching .info file. This file is a simple text file with either 1 column (the dosage filename) or 2 (dosage filename followed by the info filename separated by whitespace).
The 2nd column is only required if the filenames aren’t identical except for the extension.
|Col 1 (dosage <filename>)||Col 2 (optional info <filename>)|
Phenotypes and Covariate data can be found inside either the standard pedigree headers or within special PLINK style covariate files. Users can specify phenotypes and covariates using either header names (if a header exists in the file) or by 1 based column indices. An index of 1 actually means the first variable column, not the first column. In general, this will be the 3rd column, since columns 1 and 2 reference FID and IID.
|File containing phenotypes. Unless –all-pheno is present, user must provide either index(s) or label(s) of the phenotypes to be analyzed.|
|--mphenos LIST||Column number(s) for phenotype to be analyzed if number of columns > 1. Comma separated list if more than one is to be used.|
|Name for phenotype(s) to be analyzed (must be in –pheno file). Comma separated list if more than one is to be used.|
|File containing covariates|
|Comma-separated list of covariate indices|
|Comma-separated list of covariate names|
|--sex||Use sex from the pedigree file as a covariate|
|Encoding for missing phenotypes as can be found in the data.|
|--all-pheno||When present, mv-test will run each phenotypes found inside the phenotype file.|
Restricting regions for analysis
When specifying a range of positions for analysis, a chromosome must be present. If a chromosome is specified but is not accompanied by a range, the entire chromosome will be used. Only one range can be specified per run.
In general, when specifying region limits, –chr must be defined unless using generic MACH input (which doesn’t define a chromosome number nor position, in which case positional restrictions do not apply).
|--snps LIST||Comma-delimited list of SNP(s): rs1,rs2,rs3-rs6|
|Select Chromosome. If not selected, all chromosomes are to be analyzed.|
|SNP range start|
|SNP range end|
|SNP range start|
|SNP range end|
|SNP range start|
|SNP range end|
|--exclude LIST||Comma-delimited list of rsids to be excluded|
Comma-delimited list of individuals to be removed from analysis. This must
be in the form of family_id:individual_id
|--maf <float>||Minimum MAF allowed for analysis|
|MAX MAF allowed for analysis|
|MAX per-SNP missing for analysis|
|MAX per-person missing|
|--verbose||Output additional data details in final report|
mvmany Helper script
In addition to the analysis program, mvtest.py, a helper script, mvmany.py is also included and can be used to split large jobs into smaller ones suitable for running on a compute cluster. Users simply run mvmany.py just like they would run mvtest.py but with a few additional parameters, and mvmany.py will build multiple job scripts to run the jobs on multiple nodes. It records most arguments passed to it and will write them to the scripts that are produced.
It is important to note that mvmany.py simply generates cluster scripts and does not submit them.
The Default Template
When mvmany.py is first run, it will generate a copy of the default template inside the user’s home directory named .mv-many.template. This template is used to define the job details that will be written to each of the job scripts. By default, the template is configured for the SLURM cluster software, but can easily be changed to work with any cluster software that works similarly to the SLURM job manager, such as TORQUE/PBS or sungrid.
In addition to being able to replace the preprocessor definitions to work with different cluster manager software, the user can also add user specific definitions, such as email notifications or account specification, giving the user the the options necessary to run the software under many different system configurations.
Example Template (SLURM)
An example template might look like the following
#!/bin/bash #SBATCH –job-name=$jobname #SBATCH –nodes=1 #SBATCH –tasks-per-node=1 #SBATCH –cpus-per-task=1 #SBATCH –mem=$memory #SBATCH –time=$walltime #SBATCH –error $logpath/$jobname.e #SBATCH –output $respath/$jobname.txt
It is important to note that this block of text contains a mix of SLURM preprocessor settings (such as #SBATCH –job-name) as well as variables which will be replaced with appropriate values (such as $jobname being replaced with a string of text which is unique to that particular job). Each cluster type has it’s own syntax for setting the necessary variables and it is assumed that the user will know how to correctly edit the default template to suit their needs.
Example TORQUE Template
For instance, to use these scripts on a TORQUE based cluster, one might update ~/.mvmany.template to the following
#!/bin/bash #PBS -N $jobname #PBS -l nodes=1 #PBS -l ppn=1 #PBS -l mem=$memory #PBS -l walltime=$walltime #PBS -e $logpath/$jobname.e #PBS -o $respath/$jobname.txt
Please note that not all SLURM settings have a direct mapping to PBS settings and that it is up to the user to understand how to properly configure their cluster job headers.
In general, the user should ensure that each of the variables are properly defined so that the corresponding values will be written to the final job scripts. The following variables are replaced based on the job that is being performed and the parameters passed to the program by the user (or their default values):
|$jobname||Unique name for the current job|
|$memory (2G)||Amount of memory to provide each job.|
|$walltime (3:00:00)||Define amount of time to be assigned to jobs|
|$logpath||Directory specified for writing logs|
|$respath||Directory sepcified for writing results|
|$pwd||current working dir when mvmany is run|
|$body||Statements of execution|
Command Line Arguments
mvmany.py exposes the following additional arguments for use when running the script.
|--mv-path PATH||Set path to mvtest.py if it’s not in PATH|
|--logpath PATH||Path to location of job’s error output|
–res-path PATH Path to location of job’s results
|Path for writing script files|
|Specify a template other than the default|
|Specify the number of SNPs to be run at one time|
|--mem STRING||Specify the amount of memory to be requested for each job|
|--wall-time||Specify amount of time to be requested for each job|
The option, –mem, is dependent on the type of input that is being used as well as configurable options to be used. The user should perform basic test runs to determine proper settings for their jobs. By default, 2G is used, which is generally more than adequate for binary pedigrees, IMPUTE and transposed pedigrees. Others will vary greatly based on the size of the dataset and the settings being used.
The option, –wall-time, is largely machine dependent but will vary based on the actual dataset’s size and completeness of the data. Users should perform spot tests to determine reasonable values. By default, the requested wall-time is 3 days, which is sufficient for a GWAS dataset, but probably not sufficient for an entire whole exome dataset and the time required will depend on just how many SNPs are being analyzed by any given node.
In general, mvmany.py accepts all arguments that mvtest.py accepts, with the exception of those that are more appropriately defined by mvmany.py itself. These include the following arguments
–chr –snps –from-bp –to-bp –from-kb –to-kb –from-mb –to-mb
To see a comprehensive list of the arguments that mvmany.py can use simply ask the program itself
Users can have mvmany split certain types of jobs up into pieces and can specify how many independent commands to be run per job. At this time, mvmany.py assumes that imputation data is already split into fragments and doesn’t support running parts of a single file on multiple nodes.
The results generated can be manually merged once all nodes have completed execution.
- mvtest.py: 1.0.4
- Fixed a bug associated with running more than one phenotype at once.
- mvtest.py: 1.0.3
- Removed special requirements for MACH input (chr:pos) and made that optional.
- mvtest.py: 1.0.2
- added an exception when using improperly formatted MACH info file(s)
- updated documentation to draw attention to the additional MACH info file requirements
- mvtest.py: 1.0.1 released
- changes to the setup.cfg and setup.py to accomodate changes made to work with gh-pages.
mvtest.py: 1.0.0 released