A Docker based workflow for performing a Plink/fastStructure analysis from on DArTseq SNP data, inferred from an Excel file.
This software seeks to reduce the manual labour involved in preparing DArTseq SNP data in 1 row format for analysis with Plink and fastStructure. LAA is designed specifically for SNP data sets generated by DArTseq, in 1 row format. As such, input data will be the following metadata provided by DArTseq: “0” = Reference allele homozygote, “1”= SNP allele homozygote, “2”= heterozygote, and “-” = double null/ null allele homozygote (absence of fragment with SNP in genomic representation). LAA first converts these data into ped and map files for plink analysis.
Most of the work, besides the mentioned external packages, is done with a Python script. The primary operations performed by the script are:
- Duplicating the input data.
- Performing a substitution on certain characters in both sets of data, in order to create Plink compatible characters (i.e. “-” to “0”).
- Independently indexing both sets of data.
- Combining both sets of data.
- Sorting on the combined index.
- Transposing the combined data.
- Outputting to Plink compatible
Whereas before these steps would have been carred out manually using various software packages, they are now performed automatically.
In addition to the conversion operation, there are additional functions to perform analysis runs of Plink and fastStructre, passing the data files between the two programs automatically.
In addition to the conversion operation, LAA automatically initiates the program Plink on the generated ped and map files, and the resulting bed, bim and fam files are then passed on to and analysed with fastStructure. The user can choose a maximum of K(number of populations) to be analysed by fastStructure. Output files include the meanQ value for each individual, defining the mean probability to belong to any one of the populations K1 to Kx.
## Design Decisions
### Why Docker?
Plink is written for Linux based operating systems. As such on a Linux system all operations could be performed directly, without the need for any kind of virtualisation layer. But, in order to support researchers using Windows based operating systems the decision was made to leverage Docker virtualisation.
Docker provides a light-weight virtualisation layer enabling Linux software to run on Windows with (relative) ease. It also has the added benefit of providing a cloud based mechanism for disseminating software “images” to users. The advantage of Docker over other systems, like VirtualBox or VMWare, are:
- cloud based distribution of prebuilt images,
- future releases will allow native Docker containers, and
- easy to replicate virtual image creation.
### Why Python?
Python is a powerful and expressive scripting language. It comes with many diverse packages, and has excellent support from developers (for example, fastStructure is written in Python).
When installing on any platform there are number of requisite dependencies:
If you happen to be installing on Windows, then there are a couple of extra requirements:
- Visual Studio Python compiler
We’ve found that Docker has issues when running on Windows, resulting in faulty data transformation. While you may be able to install LAA on a Windows system, the accuracy of results are likely to be compromised.
To install on Windows, we recommend using a virtual machine running an Ubuntu installation, e.g. VMWare All steps detailed below under Installation will have to be performed through the Virtual Machine, including installing Docker.
Begin by installing all of the dependencies for your operating system as listed above.
Once complete, open a system terminal (please see the subsection on system terminals
From an open system terminal, install the LAA Python interface with:
`bash pip install lizards-are-awesome `
Next, from a system terminal, download and prepare the
laa docker image. This
fastStructure, and the conversion scripts, all built
into a light-weight Alpine linux image:
`bash laa init `
Usage is currently done directly from your operating system terminal. In Linux like operating systems (including Mac OS X) use the system terminal emulator. In Windows operating systems use the Docker quick start terminal.
### Input Format
LAA accepts XLSX Excel formats and CSV. Unfortunately, XLSX is extremely slow to parse using opensource utilities. As such we recommend converting your Excel data to CSV before use with LAA (simply open and then save as csv file using Microsoft Office or opensource spreadsheet tools, like Libre Office).
The data sheet should contain only columns with DArTseq SNP data (i.e. 0, 1, 2 and -), all other columns have to be removed. The first row should contain the name of the population each individual belongs to (e.g. species), the second row should contain the ID of each individual. All following rows contain the SNP data.
A short, fictitious, example:
<table class=”table table-bordered table-hover table-condensed”> <tbody><tr><td>Pminima</td> <td>Pminima</td> <td>Pminor</td> <td>Pminima</td> <td>Pminor</td> <td>Pminima</td> </tr> <tr><td>lizard1</td> <td>lizard2</td> <td>lizard15</td> <td>lizard39</td> <td>lizard40</td> <td>lizard44</td> </tr> <tr><td>0</td> <td>1</td> <td>1</td> <td>2</td> <td>1</td> <td>1</td> </tr> <tr><td>0</td> <td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>0</td> </tr> <tr><td>1</td> <td>-</td> <td>1</td> <td>0</td> <td>1</td> <td>1</td> </tr> <tr><td>0</td> <td>0</td> <td>1</td> <td>0</td> <td>-</td> <td>0</td> </tr> <tr><td>2</td> <td>2</td> <td>1</td> <td>1</td> <td>1</td> <td>2</td> </tr> <tr><td>2</td> <td>2</td> <td>1</td> <td>2</td> <td>1</td> <td>0</td> </tr> <tr><td>1</td> <td>1</td> <td>2</td> <td>1</td> <td>2</td> <td>1</td> </tr> <tr><td>1</td> <td>1</td> <td>1</td> <td>2</td> <td>0</td> <td>1</td> </tr> <tr><td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> <td>0</td> </tr> <tr><td>-</td> <td>1</td> <td>2</td> <td>1</td> <td>1</td> <td>1</td> </tr> </tbody></table>
And, in CSV format:
`csv Pminima,Pminima,Pminor,Pminima,Pminor,Pminima lizard1,lizard2,lizard15,lizard39,lizard40,lizard44 0,1,1,2,1,1 0,0,0,1,0,0 1,-,1,0,1,1 0,0,1,0,-,0 2,2,1,1,1,2 2,2,1,2,1,0 1,1,2,1,2,1 1,1,1,2,0,1 0,0,0,0,0,0 -,1,2,1,1,1 `
All LAA commands must be run from the same directory you have your CSV input file
in. For the purpose of the examples, let’s say we have an input file,
`bash cd /c/workspace/data `
To perform the complete process, including conversion, Plink, fastStructre and analysing for K values, you can just run:
`bash laa all input.csv --maxk=5 `
--maxk=5 may be replaced with a suitable value for the maximum K value to
This will produce a range of files in the current working directory corresponding to the outputs of the conversion, Plink, and fastStructre.
Converting the input data will peform recombination, transposition, output to a PED file, and also generation of a suitable mapping file:
`bash laa convert input.csv output.ped `
This will generate two files:
output.map. These files are
suitable for use with Plink.
To process the converted input files with Plink, run:
`bash laa plink output.ped `
To process the Plink outputs with fastStructure, run:
`bash laa fast output `
### K Choice
To run fastStructure a number of times, and then choose an appropriate K value, run:
`bash laa choosek output --maxk=5 `
--maxk=5 may be replaced with a suitable value for the maximum K value to
## Getting Help
Help is always available from the command-line. To get a printout of available commands, run:
`bash laa -h `
You may also get help for a specific command with something like:
`bash laa convert -h `
convert may be replaced with the respective command help is sought for.