Skip to main content

Find introgressed segments

Project description

Introgression detection

These are the scripts needed to infere archaic introgression in modern populations using an unadmixed outgroup.

Installation

Run the following to install:

conda create --name HMM -c bioconda python=3.7 numpy numba scipy vcftools bcftools

Overview of model

The way the model works is by removing variation found in an outgroup population and then using the remaining variants to group the genome into regions of different variant density. If the model works well we would expect that introgressed regions have higher variant density than non-introgressed - because they have spend more time accumulation variation that is not found in the outgroup.

An example on simulated data is provided below:

het_vs_archaic

In this example we zoom in on 1 Mb of simulated data for a haploid genome. The top panel shows the coalescence times with the outgroup across the region and the green segment is an archaic introgressed segment. Notice how much more deeper the coalescence time with the outgroup is. The second panel shows that probability of being in the archaic state. We can see that the probability is much higher in the archaic segment, demonstrating that in this toy example the model is working like we would hope. The next panel is the snp density if you dont remove all snps found in the outgroup. By looking at this one cant tell where the archaic segments begins and ends, or even if there is one. The bottom panel is the snp density when all variation in the outgroup is removed. Notice that now it is much clearer where the archaic segment begins and ends!

The method is now published in PlosGenetics and can be found here: Detecting archaic introgression using an unadmixed outgroup This paper is describing and evaluating the method.

Usage

Script for identifying introgressed archaic segments

Turorial:
myHMM make_test_data 
myHMM train  -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json 
myHMM decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json


Turorial with 1000 genomes data:
myHMM create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa
myHMM mutation_rate -outgroup=outgroup.txt  -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed
myHMM create_ingroup  -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa

myHMM train  -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 
myHMM decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 

Quick tutorial

Lets make some test data and start using the program.

> myHMM make_test_data
making test data...
creating 2 chromosomes with 50 Mb of test data (100K bins) with the following parameters..

State names: ['Human' 'Archaic']
Starting_probabilities: [0.98 0.02]
Transition matrix: 
[[9.999e-01 1.000e-04]
 [2.000e-02 9.800e-01]]
Emission values: [0.04 0.4 ]

This will generate 4 files, obs.txt, weights.bed, mutrates.bed and Initialguesses.json.

obs.txt. These are the mutation that are left after removing variants which are found in the outgroup.

chrom  pos     ancestral_base  genotype
chr1   5212    A               AG
chr1   32198   A               AG
chr1   65251   C               CG
chr1   117853  A               AG
chr1   122518  T               TC
chr1   142322  T               TC
chr1   144695  C               CG
chr1   206370  T               TG
chr1   218969  A               AT

weights.bed. This is the parts of the genome that we can accurately map to - in this case we have simulated the data and can accurately access the entire genome.

chr1	1	50000000
chr2	1	50000000

mutrates.bed. This is the normalized mutation rate across the genome (in bins of 1 Mb).

chr1  0        1000000   1
chr1  1000000  2000000   1
chr1  2000000  3000000   1
chr1  3000000  4000000   1
chr1  4000000  5000000   1
chr1  5000000  6000000   1
chr1  6000000  7000000   1
chr1  7000000  8000000   1
chr1  8000000  9000000   1
chr1  9000000  10000000  1

Initialguesses.json. This is our initial guesses when training the model - note these are different from those we simulated from.

{
  "state_names": ["Human","Archaic"],
  "starting_probabilities": [0.5,0.5],
  "transitions": [[0.99,0.01],[0.02,0.98]],
  "emissions": [0.03,0.3]
}

We can find the best fitting parameters using BaumWelsch training - note you can try to ommit the weights and mutrates arguments. Since this is simulated data the mutation is constant across the genome and we can asses the entire genome. Also notice how the parameters approach the parameters the data was generated from (jubii).

> myHMM train  -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=Initialguesses.json -out=trained.json

iteration    loglikelihood  start1  start2  emis1   emis2   trans1_1  trans2_2
0            -18123.4475    0.5     0.5     0.03    0.3     0.99      0.98
1            -17506.0219    0.96    0.04    0.035   0.2202  0.9969    0.9242
2            -17487.797     0.971   0.029   0.0369  0.2235  0.9974    0.9141
3            -17477.1367    0.976   0.024   0.0375  0.2404  0.9978    0.9105
4            -17466.6961    0.98    0.02    0.0379  0.2627  0.9982    0.9102
5            -17456.1508    0.983   0.017   0.0382  0.2877  0.9985    0.9123
6            -17445.5098    0.986   0.014   0.0385  0.3146  0.9988    0.9172
7            -17434.9006    0.988   0.012   0.0388  0.3426  0.9991    0.9248
8            -17424.6966    0.99    0.01    0.039   0.3705  0.9993    0.9348
9            -17415.623     0.991   0.009   0.0393  0.3968  0.9995    0.9464
10           -17408.6263    0.992   0.008   0.0395  0.4195  0.9997    0.9579
11           -17404.3022    0.993   0.007   0.0396  0.4367  0.9998    0.9673
12           -17402.299     0.994   0.006   0.0397  0.4477  0.9998    0.9738
13           -17401.6146    0.994   0.006   0.0398  0.4537  0.9999    0.9774
14           -17401.4336    0.994   0.006   0.0398  0.4566  0.9999    0.9793
15           -17401.3933    0.994   0.006   0.0398  0.4578  0.9999    0.9802
16           -17401.3851    0.994   0.006   0.0398  0.4584  0.9999    0.9806
17           -17401.3835    0.994   0.006   0.0398  0.4586  0.9999    0.9807
18           -17401.3832    0.994   0.006   0.0398  0.4587  0.9999    0.9808
19           -17401.3832    0.994   0.006   0.0398  0.4588  0.9999    0.9808


# run without mutrate and weights (only do this for simulated data)
> myHMM train  -obs=obs.txt -param=Initialguesses.json -out=trained.json

We can now decode the data with the best parameters that maximize the likelihood and find the archaic segments:

> myHMM decode -obs=obs.txt -weights=weights.bed -mutrates=mutrates.bed -param=trained.json

chrom        start     end       length    state    snps  mean_prob
chr1         0         7233000   7234000   Human    287   0.9995
chr1         7234000   7246000   13000     Archaic  9     0.90427
chr1         7247000   21618000  14372000  Human    610   0.99946
chr1         21619000  21673000  55000     Archaic  22    0.9697
chr1         21674000  26859000  5186000   Human    204   0.99878
chr1         26860000  26941000  82000     Archaic  36    0.971
chr1         26942000  49989000  23048000  Human    863   0.99982
chr2         0         6793000   6794000   Human    237   0.99972
chr2         6794000   6822000   29000     Archaic  14    0.95461
chr2         6823000   12646000  5824000   Human    244   0.99927
chr2         12647000  12745000  99000     Archaic  55    0.97413
chr2         12746000  15461000  2716000   Human    125   0.99881
chr2         15462000  15547000  86000     Archaic  38    0.93728
chr2         15548000  32626000  17079000  Human    709   0.99951
chr2         32627000  32695000  69000     Archaic  31    0.98305
chr2         32696000  41087000  8392000   Human    360   0.9995
chr2         41088000  41178000  91000     Archaic  43    0.96093
chr2         41179000  49952000  8774000   Human    328   0.9979
chr2         49953000  49977000  25000     Archaic  13    0.98501

# Again here you could ommit weights and mutationrates. Actually one could also ommit trained.json because then the model defaults to using the parameters we used the generated the data
> myHMM decode -obs=obs.txt

Example with 1000 genomes data

I thought it would be nice to have an entire reproduceble example of how to use this model. From a common starting point such as a VCF file (well a BCF file in this case) to the final output. The reason for using BCF files is because it is MUCH faster to extract data for each individual. You can convert a vcf file to a bcf file like this:

bcftools view file.vcf -l 1 -O b > file.bcf
bcftools index file.bcf

In this example I will analyse an individual (HG00096) from the 1000 genomes project phase 3.

First we will need to know which 1) bases can be called in the genome and 2) which variants are found in the outgroup. So let's start out by downloading the files from the following directories. To download callability regions, ancestral alleles information, ingroup outgroup information call this command:

# bcffiles (hg19)
ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/bcf_files/

# callability (remember to remove chr in the beginning of each line to make it compatible with hg19 e.g. chr1 > 1)
ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed
sed 's/^chr\|%$//g' 20141020.strict_mask.whole_genome.bed | awk '{print $1"\t"$2"\t"$3}' > strickmask.bed

# outgroup information
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

# Ancestral information
ftp://ftp.ensembl.org/pub/release-74/fasta/ancestral_alleles/homo_sapiens_ancestor_GRCh37_e71.tar.bz2

For this example we will use all individuals from 'YRI','MSL' and 'ESN' as outgroup individuals. While we will only be decoding hG00096 in this example you can add as many individuals as you want to the ingroup.

{
  "ingroup": [
    "HG00096",
    "HG00097"
  ],
  "outgroup": [
    "HG02922",
    "HG02923",
    ...
    "HG02944",
    "HG02946"]
}

This is how we would call archaic segments in an individual. First we need to find a set of variants found in the outgroup. We can use the wildcard character to loop through all bcf files. If you dont have ancestral information you can skip the ancestral argument.

(took an hour) > myHMM create_outgroup -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa
(took 30 sec) > myHMM mutation_rate -outgroup=outgroup.txt  -weights=strickmask.bed -window_size=1000000 -out mutationrate.bed

Keep variants that are not found to be derived in the outgroup for each individual in ingroup. You can also speficy a single individual or a comma separated list of individuals.

(took 20 min) > myHMM create_ingroup  -ind=individuals.json -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa

(took 20 min) > myHMM create_ingroup  -ind=HG00096 -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa

(took 20 min) > myHMM create_ingroup  -ind=HG00096,HG00097 -vcf=*.bcf -weights=strickmask.bed -out=obs -outgroup=outgroup.txt -ancestral=homo_sapiens_ancestor_GRCh37_e71/homo_sapiens_ancestor_*.fa

Now for training the HMM parameters and decoding

(took 3 min) > myHMM train  -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -out=trained.HG00096.json 

iteration  loglikelihood  start1  start2  emis1   emis2   trans1_1  trans2_2
0          -538494.225    0.98    0.02    0.04    0.4     0.9999    0.98
1          -531761.7479   0.957   0.043   0.0543  0.385   0.9994    0.9859
2          -531645.8243   0.955   0.045   0.0545  0.3911  0.9992    0.9836
3          -531598.8719   0.955   0.045   0.0544  0.3945  0.9991    0.9818
4          -531577.0133   0.954   0.046   0.0542  0.3961  0.9991    0.9806
5          -531566.2204   0.954   0.046   0.0541  0.3968  0.999     0.9797
6          -531560.6752   0.954   0.046   0.054   0.3972  0.999     0.979
7          -531557.7466   0.954   0.046   0.054   0.3973  0.999     0.9786
8          -531556.1697   0.954   0.046   0.0539  0.3973  0.9989    0.9782
9          -531555.3088   0.954   0.046   0.0539  0.3972  0.9989    0.978
10         -531554.8342   0.953   0.047   0.0539  0.3971  0.9989    0.9778
11         -531554.5707   0.953   0.047   0.0539  0.3971  0.9989    0.9777
12         -531554.4236   0.953   0.047   0.0538  0.397   0.9989    0.9776
13         -531554.3412   0.953   0.047   0.0538  0.3969  0.9989    0.9775
14         -531554.295    0.953   0.047   0.0538  0.3969  0.9989    0.9774
15         -531554.2689   0.953   0.047   0.0538  0.3969  0.9989    0.9774
16         -531554.2542   0.953   0.047   0.0538  0.3968  0.9989    0.9774
17         -531554.2459   0.953   0.047   0.0538  0.3968  0.9989    0.9773
18         -531554.2412   0.953   0.047   0.0538  0.3968  0.9989    0.9773
19         -531554.2386   0.953   0.047   0.0538  0.3968  0.9989    0.9773
20         -531554.2371   0.953   0.047   0.0538  0.3968  0.9989    0.9773
21         -531554.2363   0.953   0.047   0.0538  0.3968  0.9989    0.9773
22         -531554.2358   0.953   0.047   0.0538  0.3968  0.9989    0.9773
23         -531554.2355   0.953   0.047   0.0538  0.3968  0.9989    0.9773
24         -531554.2354   0.953   0.047   0.0538  0.3968  0.9989    0.9773
25         -531554.2353   0.953   0.047   0.0538  0.3968  0.9989    0.9773
(took 30 sec) > myHMM decode -obs=obs.HG00096.txt -weights=strickmask.bed -mutrates=mutationrate.bed -param=trained.HG00096.json 

chrom  start    end      length   state    snps  mean_prob
1      0        2788000  2789000  Human    97    0.9816
1      2789000  2793000  5000     Archaic  3     0.52798
1      2794000  2988000  195000   Human    11    0.96136
1      2989000  2996000  8000     Archaic  6     0.62345
1      2997000  3424000  428000   Human    33    0.99159
1      3425000  3450000  26000    Archaic  22    0.97091
1      3451000  4267000  817000   Human    42    0.98602
1      4268000  4269000  2000     Archaic  2     0.50461
1      4270000  4301000  32000    Human    2     0.80669
1      4302000  4359000  58000    Archaic  20    0.7938
1      4360000  4499000  140000   Human    5     0.97461
1      4500000  4509000  10000    Archaic  7     0.8114
1      4510000  5338000  829000   Human    44    0.98449
1      5339000  5348000  10000    Archaic  6     0.71886
1      5349000  9319000  3971000  Human    173   0.99088

The first Archaic segment with a high mean posterior probability is from 3,425,000 to 3,450,000. And this segment has also been found be other methods:

# Conditional random field (S. Sankararaman - 2014)
chr1     3,421,722-3,450,286 HG00096
# Sstar (B. Vernot - 2016)
chr1     3,427,298-3,461,813 HG00096
# Sprime (S. Browning - 2018)
chr1     3,418,794-3,457,377 HG00096
# HMM (L. Skov 2018)
chr1     3,425,000-3,450,000 HG00096

And that is it! Now you have run the model and gotten a set of parameters that you can interpret biologically (see my paper) and you have a list of segments that belong to the human and Archaic state.

If you have any questions about the use of the scripts, if you find errors or if you have feedback you can contact my here (make an issue) or write to:

lauritsskov2@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

myHmm-0.0.1.tar.gz (33.4 kB view hashes)

Uploaded Source

Built Distribution

myHmm-0.0.1-py3-none-any.whl (29.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page