Skip to main content

Haplotype-based inference of recent effective population size in modern and ancient DNA samples

Project description

HapNe

Haplotype-based inference of recent effective population size in modern and ancient DNA samples

Summary

  1. Prerequisites
  2. HapNe-LD
  3. HapNe-IBD
  4. Analyses of ancient samples
  5. How to cite

1. Prerequisites

Some pre-processing features require plink1.9 and plink2 to be installed. HapNe assumes that the commands plink and plink2 work in the terminal.

All functionalities have been tested on macOS and Linux within the following conda environment:

name: HapNe
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python
  - pytest
  - numpy
  - pandas
  - plink
  - plink2
  - flake8
  - numba

We strongly encourage to install HapNe within this environment by running: conda env create --file conda_environment.yml

2. HapNe-LD

HapNe-LD can be run by adapting the following config file:

[CONFIG]
vcf_file=data
keep=data.keep
map=genetic_map_chr@_combined_b37.txt
pseudo_diploid=False
output_folder=HapNe/data
population_name=POP
genome_build=grch37
  • vcf_file: path to the vcf file (without the .vcf.gz extension)
  • keep (facultative): samples to keep, useful to filter out relatives
  • map: path to the genetic maps
  • pseudo_diploid: False for modern data, true for ancient ones
  • output_folder: folder where the results will be saved
  • population_name: name of the analysis
  • genome_build: genome build used (grch37 (default) or grch38)

The analysis can be run using a script like this one:

from configparser import ConfigParser
import argparse

from hapne.convert.tools import split_convert_vcf
from hapne.ld import compute_ld, compute_ccld
from hapne import hapne_ld


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='HapNe-LD pipeline')
    parser.add_argument('--config_file',
                    help='configfile')
    args = parser.parse_args()

    config = ConfigParser()
    config.read(args.config_file)
    print("Starting stage 1")
    split_convert_vcf(config)
    print("Starting stage 2")
    compute_ld(config)
    compute_ccld(config)
    print("Starting stage 3")
    hapne_ld(config)

3. Running HapNe-IBD

Starting from a vcf file, HapNe starts by splitting the file into different genomic regions and convert them into FastSMC's input format:

from configparser import ConfigParser
import pandas as pd
import argparse

from hapne.convert.tools import vcf2fastsmc

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='HapNe-IBD preprocessing pipeline')
    parser.add_argument('--config_file',
                    help='configfile')
    args = parser.parse_args()

    config = ConfigParser()
    config.read(args.config_file)
    print("Starting stage 1")
    vcf2fastsmc(config)

FastSMC can then be run on the 39 .haps files generated by the previous command.

It is then required to add the location of the FastSMC output in the config file

[CONFIG]
vcf_file=data
keep=data.keep
map=genetic_map_chr@_combined_b37.txt
pseudo_diploid=False
output_folder=HapNe/data
population_name=POP
ibd_files=FASTSMC_OUTPUT_FOLDER
genome_build=grch37 # or grch38

Using this config file, HapNe-IBD can be run using the following script:

from configparser import ConfigParser
from hapne.ibd import build_hist

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='HapNe-IBD preprocessing pipeline')
    parser.add_argument('--config_file',
                    help='configfile')
    args = parser.parse_args()

    config = ConfigParser()
    config.read(args.config_file)
    build_hist(config)
    hapne_ibd(config)

4. aDNA analyses

HapNe provides a pipeline to easily study samples from the "Allen Ancient DNA Resource" data set. After downloading the data, HapNe can take a file with the indices of samples to study as input (Caribbean_Ceramic_recent.keep in the following example).

Note that it is assumed that samples present in the keep file are unrelated. Kinship information is usually present in the anno file.

To perform the analysis, create the following configuration file:

[CONFIG]
eigen_root=DATA/v50.0_1240k_public
anno_file=DATA/v50.0_1240k_public.anno
keep=CONFIG/Caribbean_Ceramic_recent.keep
pseudo_diploid=True
output_folder=RESULTS/Caribbean_Ceramic_recent
population_name=Caribbean_Ceramic_recent

eigen_root describes the location to the main data set, anno_file points to the annotation file, keep refers to as a file containing the indices of the individuals to study (one index per row).

The output will be written in a new output_folder folder. pseudo_diploid must be set to true when studying ancient data. Finally, population_name will be used to name the output files.

Next, the following pipeline.py script can be run using python pipeline.py --config_file config.ini

from configparser import ConfigParser
import pandas as pd
import argparse

from hapne.convert.eigenstrat2vcf import eigenstrat2vcf
from hapne.convert.eigenstrat2vcf import split_convert_vcf
from hapne.ld import compute_ld, compute_ccld, create_cc_file
from hapne.utils import get_age_from_anno
from hapne import hapne_ld


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='HapNe-LD pipeline')
    parser.add_argument('--config_file',
                    help='configfile')
    args = parser.parse_args()

    config = ConfigParser()
    config.read(args.config_file)
    print("Starting stage 1")
    eigenstrat2vcf(config)
    print("Starting stage 2")
    split_convert_vcf(config)
    print("Starting stage 3")
    compute_ld(config)
    compute_ccld(config)
    print("Starting stage 4")
    get_age_from_anno(config)
    hapne_ld(config)

5. How to cite?

If you use this software, please cite:

R. Fournier, D. Reich, P. Palamara. Haplotype-based inference of recent effective population size in modern and ancient DNA samples. (preprint) bioRxiv, 2022.

Acknowledgments

Two scripts of the convert module were downloaded from the following repositories and edited to fit into this package:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hapne-1.20230724.tar.gz (49.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hapne-1.20230724-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file hapne-1.20230724.tar.gz.

File metadata

  • Download URL: hapne-1.20230724.tar.gz
  • Upload date:
  • Size: 49.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for hapne-1.20230724.tar.gz
Algorithm Hash digest
SHA256 3baec89f2912ba6b301cc39b3a2c1ed405750495073419b7fe2cae2fd59fb393
MD5 9152a6e5bb9252223f4a0f6a0f6e8192
BLAKE2b-256 585b14c550f92c8d3b2f7c75952aaff49ed6d7313ef3bbb35d3fa02ece54f965

See more details on using hashes here.

File details

Details for the file hapne-1.20230724-py3-none-any.whl.

File metadata

  • Download URL: hapne-1.20230724-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for hapne-1.20230724-py3-none-any.whl
Algorithm Hash digest
SHA256 f3adb0924fafbdec167a9c1cff7baa2ceb62e16b720d7fa6a11ea340bd8ac766
MD5 70a16fa1a891ee46bd3d62c6fbdac280
BLAKE2b-256 2f52f8292692fd21bae50cfee7f6b29f07ad5ef071a954a6c88603929a4dccf2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page