Pastrami is a novel, scalable computational algorithm for rapid human ancestry estimation at population-, subcontinental- and continental-levels. Pastrami works on two key methodologies: exact haplotype matching and non-negative least square (NNLS) optimization.
Project description
Pastrami
Pastrami is a novel, scalable computational algorithm for rapid human ancestry estimation at population-, subcontinental- and continental-levels. Pastrami works on two key methodologies: exact haplotype matching and non-negative least square (NNLS) optimization.
Codebase stage: development
Developers and maintainers: Andrew Conley, Lavanya Rishishwar, Shivam Sharma, Sonali Gupta
Testers: Lavanya Rishishwar, Shivam Sharma, Emily Norris
Recommended installation methods
conda install -c bioconda pastrami
pip install pastrami
Git based installation (not recommended)
Plink2 installation
Plink2 installation is quite straightforward. Select the appropriate binary for your processor architecture from this page:https://www.cog-genomics.org/plink/2.0/. An example installation is shown below for Intel 64-bit architecture:
# Download the zip file
wget http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20210203.zip
# Unzip the zip file
unzip plink2_linux_x86_64_20210203.zip
# Optionally, create a local bin and add it to your PATH (ignore these steps if you know what you are doing)
mkdir -p ~/bin
echo "PATH=$HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
# Move the file to your local bin
mv plink2 ~/bin
# Make sure there is an executable by the name plink
ln -s ~/bin/plink2 ~/bin/plink
# Create a conda environment with Python version = 3.8
conda create -n pastrami python=3.8
# Activate the newly created conda environment
conda activate pastrami
# Install core packages
conda install -c conda-forge pathos numpy pandas
or
pip install pathos numpy pandas
# Download Pastrami
git clone https://github.com/healthdisparities/pastrami
# Give it executable permissions
cd pastrami
chmod +x pastrami.py
# Run Pastrami
./pastrami.py
# Deactivate environment, when not using pastrami
conda deactivate pastrami
Dependencies
Pastrami requires the following OS/programs/modules to run:
- Linux/Mac OS - tested on RedHat OS, expected to run on Mac environments, will possibly work on WSL though Mac and WSL haven't been tested so far
- Plink2 (recommended over Plink, but both will work) - https://www.cog-genomics.org/plink/2.0/; should be accessibly through $PATH
- Python version 3.8 or higher
- Python libraries = Pathos, numpy, scipy, pandas
Quickstart guide
This section will be populated in near future with small example datasets and commands for analyzing them.
Basic usage
General program structure overview
Pastrami is a five step process that can be run as a single command or using a set of sequential commands. Each of the five step within Pastrami can be accessed through the following subcommands:
- hapmake: Create haplotype map file
- build: Create a database from reference genotype file (in TPED/TFAM format)
- query: Query the database using a query genotype file (in TPED/TFAM format)
- coanc: Create copying fraction matrix by performing pairwise individual comparisons
- aggregate: Estimate ancestral fractions and aggregate them into fine-grain and sub-continental ancestry
usage: pastrami.py [--help] {all,hapmake,build,query,coanc,aggregate} ...
pastrami.py - population scale haplotype copying
optional arguments:
--help, -h, --h
pastrami.py commands:
{all,hapmake,build,query,coanc,aggregate}
all Perform full analysis
hapmake Create haplotypes
build Build reference set
query Query reference set
coanc Individual v. individual co-ancestry
aggregate Aggregate results and estimate ancestries
These steps are all grouped within the all subcommand.
Input file format
The genotype and individual information is expected in TPED/TFAM format. These formats are described at length within Plink's documentation and will not be reiterated here for brevity's sake.
- TPED format description: https://www.cog-genomics.org/plink/1.9/formats#tped
- TFAM format description: https://www.cog-genomics.org/plink/1.9/formats#tfam
If your files are in VCF format, you can convert them into TPED/TFAM format using the following plink command:
plink --vcf input.vcf.gz --recode transpose --out output_prefix
A typical run
The all subcommand is the one that you will use if you have a set of reference individuals and a set of query individuals (presumably with some admixture) and you are interested in performing a one-off ancestry estimation analysis:
# Consolidated workflow:
./pastrami.py all --reference-prefix reference --query-prefix query --pop-group pop2group.txt --haplotype chrom.hap --out-prefix full_run -v --threads 20
The full description of each commandline option is provided below and can be access from within the script by running the "all" subcommand without any arguments:
usage: pastrami.py all [-h] [--reference-prefix <PREFIX>] [--query-prefix <TSV>] [--haplotypes <TSV>]
[--pop-group pop2group.txt] [--out-prefix out-prefix] [--map-dir maps/] [--min-snps MinSNPs]
[--max-snps MaxSNPs] [--max-rate MaxRate] [--per-individual] [--log-file run.log] [--threads N]
[--verbose]
optional arguments:
-h, --help show this help message and exit
Required Input options:
--reference-prefix <PREFIX> Prefix for the reference TPED/TFAM input files
--query-prefix <TSV> Prefix for the query TPED/TFAM input files
--haplotypes <TSV> File of haplotype positions
--pop-group pop2group.txt File containing population to group (e.g., tribes to region) mapping
Output options:
--out-prefix out-prefix Output prefix (default: pastrami) for creating following sets of files.
<prefix>.pickle, <prefix>_query.tsv, <prefix>.tsv, <prefix>.fam,
<prefix>_fractions.Q, <prefix>_paintings.Q, <prefix>_estimates.Q,
<prefix>_fine_grain_estimates.Q
Optional arguments for hapmake stage:
--map-dir maps/ Directory containing genetic maps: chr1.map, chr2.map, etc
--min-snps MinSNPs Minimum number of SNPs in a haplotype block (default: 7)
--max-snps MaxSNPs Maximum number of SNPs in a haplotype block (default: 20)
--max-rate MaxRate Maximum recombination rate (default: 0.3)
Runtime options:
--per-individual Generate per-individual copying rather than per-population copying
--log-file run.log, -l run.log, --l run.log File containing log information (default: run.log)
--threads N, -t N Number of concurrent threads (default: 4)
--verbose, -v Print program progress information on screen
Consider an example scenario where you want to assess the European ancestry estimates of a group of query individuals. If your European reference files are named euro_ref.tped, euro_ref.tfam and your query individual files are named my_query.tped, my_query.tfam, and your population to group mapping is stored within the pop2group.txt file (described more below), your all subcommand will look as follows:
./pastrami.py all --reference-prefix euro_ref --query-prefix my_query --pop-group pop2group.txt --haplotype chrom.hap --out-prefix full_run -v --threads 20
I prefer to run pastrami the the verbose flag and my hardware can support 20 threads. Feel free to turn off the verbose flag if you so desire, and change the number of threads in accordance with your hardware capability. This command will create the following outputs (listed in order of importance):
- full_run_estimates.Q: ADMIXTURE-style ancestry fraction estimates at group level. This is the output results that you are interested in.
- full_run_fine_grain_estimates.Q: ADMIXTURE-style ancestry fraction estimates at population level. Population level estimates are generally harder to tease apart and are likely to be less reliable. These results are aggregated in full_run_estimates.Q file.
- full_run.pickle: this is a Python pickle object that stores the processed reference file (consider this as the "database"). Save this file if you'd like to continue query against the same database in future.
- full_run_query.tsv: this is your processed query file that can be used. This file is not currently being used in any process and will be removed in future.
- full_run.tsv: this is the copying fractions file that contains all the reference and query individuals. This file is not currently being used in any process and will be removed in future.
- full_run.fam: these are the individuals that were processed as part of this run
- full_run_fractions.Q: these are the ancestry fractions (not estimates) from which the final estimates are calculated from. This file is not currently being used in any process and will be removed in future.
- full_run_paintings.Q: these are the ancestry paintings which are used alongside ancestry fractions to calculate ancestry estimates. This file is not currently being used in any process and will be removed in future.
Stepwise execution
Pastrami is designed in a way that the reference files can be processed once and utilized many times in future. For the following use-case, consider that your reference individuals are African in origin and are stored in afro_ref.tped and afro_ref.tfam files. Your query individuals are still stored in my_query.tped and my_query.tfam files. Your population-to-group mapping is stored in a file named pop2group.txt. The following series of commands will allow you the pipeline in steps and save the intermediate outputs.
# step 1: build a haplotype file
./pastrami.py hapmake --map-dir map_data --haplotypes african.hap -v
# This step shouldn't take more than a few seconds, the remaining step will be time consuming in nature
# step 2: build database (pickle file)
./pastrami.py build --reference-pickle african.pickle --reference-prefix afro_ref --haplotypes african.hap -v --threads 20
# african.pickle is the output file
# step 3: querying the database
./pastrami.py query --reference-pickle african.pickle --query-prefix my_query --combined-out african.tsv --query-out african_query.tsv -v --threads 20
# step 4: aggregate the results
## combine the individual list for aggregate subcommand
cat afro_ref.tfam my_query.tfam > input.tfam
## run the subcommand
./pastrami.py aggregate --pop-group pop2group.txt --pastrami-output african.tsv --pastrami-fam input.tfam --out-prefix aggregate_output -v --threads 20
Pop-group mapping
Based on general expectation and our experience in this area, deconvoluting population-level ancestry is very hard (and at times not possible) when populations are genetically extremely close. In these cases, we strongly recommend grouping closely related populations into groups. E.g., grouping African tribes into subcontinental African groups. This information can be supplied to Pastrami in a simple tab-separated file (pop2group.txt file in the previous examples). The structure of the file looks like this:
#Population Group
GWD Western
MSL Western
Yacouba Western
Ahizi Western
Fon Nigerian
Bariba Nigerian
Yoruba Nigerian
The whitespace separating the column is tab (shown as spaces here for formatting reasons). Any lines starting with # are ignored.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pastrami-0.9.6.tar.gz
.
File metadata
- Download URL: pastrami-0.9.6.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c170a12b80bc947858697eafd04bfe8aa643cd4a6655843e937174c4aea56a4f |
|
MD5 | dabb82fc92be1c32c13137703e706af7 |
|
BLAKE2b-256 | 88c8342450bfd539535f7a1e14b0d4eaa5bcb2eb9343200190173a3370d8d593 |
File details
Details for the file pastrami-0.9.6-py3-none-any.whl
.
File metadata
- Download URL: pastrami-0.9.6-py3-none-any.whl
- Upload date:
- Size: 35.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5ab02f9cef0b15f0bc072d7a7ce14251db5009d6bf80f5f048d4438a10dd828 |
|
MD5 | b842757e1987d1c3226ea63a0e84f197 |
|
BLAKE2b-256 | 8ec22d47b93a184a85815d27d9f5dbb62a2843bf32fe0b1de4c507db9964d64d |