Skip to main content

Software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.

Project description

License: MIT

PHASEfilter

PHASEfilter is a software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes. It is an easily implementable tool that provides a simple approach to detect and filter heterozygous SNPs and INDELs in diploid species based on a phased genome assembly.

Installation

This installation is oriented for Linux distributions.

Install directly

$ pip3 install PHASEfilter

Install with virtualenv

$ virtualenv PHASEfilter --python=python3 --prompt "(PHASEfilter 1.1.0) "
$ . PHASEfilter/bin/activate
(phasefilter 1.1.0) $ pip install PHASEfilter

## install all Software dependencies of PHASEfilter 
(phasefilter 1.1.0) $ cd PHASEfilter/bin/
(phasefilter 1.1.0) $ ./install_phasefilter_dependencies.sh

Install with conda

$ wget https://raw.githubusercontent.com/ibigen/PHASEfilter/main/conda/conda_phasefilter_env.yml -O conda_phasefilter_env.yml
$ conda env create -f conda_phasefilter_env.yml
$ conda activate PHASEfilter

The follow software must be available in your computer:

All software available

Filter variants in phased genomes

This software that can identify heterozygosity positions between two phased references. The software starts by aligning pairs of diploid chromosomes, based on Minimap2 aligner. With synchronization done it is possible to identify the position of a variation, in both pair of chromosomes, allowing variants to be removed if they meets some established criterias. To classify variants it is necessary to pass two VCF files, one for each reference phase. After that, the PHASEfilter will go through the variants called in reference A and check if there are any homologous in the variants called in reference B. For each variant called in the reference A it can happen three situations: 1) both references, for the position in analysis, are equal and the variant is valid; 2) the position is heterozygous in the references and the variant reflects it, so the variant is removed; 3) the position is heterozygous in the references and the variant is homozygous. It goes to the valid variants file but it also go to the Loss Of Heterozygous (LOH) file. The variant file in analysis it is always the one passed in parameter '--vcf1'.

$ phasefilter --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir

## you can use chain if exists
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir --chain_A_B temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain --chain_B_A temp_raw_data/Assembly22_hapB_To_Assembly22_hapA.over.chain

Eighth possible files will be created after the commands ends. The outputs are from refrence A (ref1) to reference B (ref2), and from reference B (ref2) to reference A (ref1).

  • report_[A]to[B].txt - has the statistics about the analysis;

  • valid_[A]to[B].vcf.gz - has all variants that are not heterozygous between two references;

  • removed_[A]to[B].vcf.gz - has all heterozygous variants;

  • LOH_[A]to[B].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.

  • report_[B]to[A].txt - has the statistics about the analysis from ;

  • valid_[B]to[A].vcf.gz - has all variants that are not heterozygous between two references;

  • removed_[B]to[A].vcf.gz - has all heterozygous variants;

  • LOH_[B]to[A].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.

Headings description in report files:

  • Heterozygous (Removed) Heterozygous identified and they go the re remove_[YYY]_to[XXX].vcf.gz file
  • Keep alleles Alleles present in valid_[YYY]_to[XXX].vcf.gz file
  • LOH alleles Loss of Heterozygous They are in valid_[YYY]to[XXX].vcf.gz and LOH[YYY]_to[XXX].vcf.gz file.
  • Other than SNP Other variants thar are not SNPs and INDELs and they go to valid_[YYY]_to[XXX].vcf.gz file
  • Don't have hit position Variants that don’t have position in hit (ref B) genome and they go to valid_[YYY]_to[XXX].vcf.gz file
  • Could Not Fetch VCF Record on Hit Variants that are present in source file but not in hit VCF file. They go to valid_[YYY]_to[XXX].vcf.gz file
  • Total alleles All the alleles present in the source vcf file. Analyzed alleles.
  • Total Alleles new Source VCF Total alleles that are in valid_[YYY]_to[XXX].vcf.gz file
  • Method Alignment method.
  • Alignment % Percentage of alignment.

Note: You can can copy some example data to test the commands. {: .note}

Filter variants in phased genomes but only one direction

This tool do as the same of the previous script but only analysis from Reference A (ref1) to Reference B (ref2)

$ phasefilter_single --help
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz

## with chain
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz --chain temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain

Synchronize annotation genomes

Synchronize annotations genomes adapting the annotations that are in reference 1 to the reference 2, adding the tags 'StartHit' and 'EndHit' to the result file. In VCF type files only add 'StartHit' tag in Info. The annotations (input file need to be in VCF or GFF3 and belong to the reference 1.

$ synchronize_genomes --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ synchronize_genomes --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --gff temp_raw_data/T1_Fluc_7A_snps.gff3.gz --out T1_Fluc_7A_snps.sync.gff3.gz
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --gff S288C_reference.gff3 --out result.gff3 --pass_chr chrmt
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --vcf S288C_reference.vcf.gz --out result.vcf.gz

Make alignments

Obtain the percentage of the minimap2 alignment between chromosomes and create an output in ClustalX format.

$ make_alignment --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ make_alignment --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --out report.txt
$ make_alignment --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --out report.txt --pass_chr chrmt --out_alignment syncronizationSacharo

Reference Statistics

With this application it is possible to obtain the number of nucleotides by chromosome.

$ reference_statistics --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ reference_statistics --ref temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --out report_stats.txt
$ reference_statistics --ref Ca22chr1A_C_albicans_SC5314.fasta.gz --out retport.txt

Copy some example data to test all tools

It is possible to copy some example data and test the tools available

$ copy_raw_data_example_phasefilter --help
$ copy_raw_data_example_phasefilter --out temp_dir

Documentation

PHASEfilter documentation is available in ReadTheDocs: PHASEfilter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PHASEfilter-1.1.0.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PHASEfilter-1.1.0-py3-none-any.whl (5.3 MB view details)

Uploaded Python 3

File details

Details for the file PHASEfilter-1.1.0.tar.gz.

File metadata

  • Download URL: PHASEfilter-1.1.0.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.0 CPython/3.6.9

File hashes

Hashes for PHASEfilter-1.1.0.tar.gz
Algorithm Hash digest
SHA256 bc1db109e9d28168138ce32e62bf36e19ca3e17650fc962c9dc0426d12adbd59
MD5 ab9e8f332ff281a27699d1ff95b754c7
BLAKE2b-256 454f4afad6213ebd3b21f8a36cbcd8b17c3c6264054507b9790378441c6fd589

See more details on using hashes here.

File details

Details for the file PHASEfilter-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: PHASEfilter-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.0 CPython/3.6.9

File hashes

Hashes for PHASEfilter-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 087e4d5940560b121fd64e4de84c809fbe3c09c6eb8aedb4a9532e8de21acacf
MD5 dc508fdc00e06904f6a871582af22cad
BLAKE2b-256 e80cb1ba707d308767faf0768f96d331da1e112b9a337f43b56ee9f8d6d8aff6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page