Software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.

These details have not been verified by PyPI

Project links

Project description

PHASEfilter

PHASEfilter is a software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes. It is an easily implementable tool that provides a simple approach to detect and filter heterozygous SNPs and INDELs in diploid species based on a phased genome assembly.

Installation

This installation is oriented for Linux distributions.

Install directly

$ pip3 install PHASEfilter

Install with virtualenv

$ virtualenv PHASEfilter --python=python3 --prompt "(PHASEfilter 1.1.0) "
$ . PHASEfilter/bin/activate
(phasefilter 1.1.0) $ pip install PHASEfilter

## install all Software dependencies of PHASEfilter 
(phasefilter 1.1.0) $ cd PHASEfilter/bin/
(phasefilter 1.1.0) $ ./install_phasefilter_dependencies.sh

Install with conda

$ wget https://raw.githubusercontent.com/ibigen/PHASEfilter/main/conda/conda_phasefilter_env.yml -O conda_phasefilter_env.yml
$ conda env create -f conda_phasefilter_env.yml
$ conda activate PHASEfilter

The follow software must be available in your computer:

minimpa2 v2.22 or up
bcftools v1.3 or up
samtools v1.3 or up
htslib v1.3 or up

All software available

Filter variants in phased genomes

This software that can identify heterozygosity positions between two phased references. The software starts by aligning pairs of diploid chromosomes, based on Minimap2 aligner. With synchronization done it is possible to identify the position of a variation, in both pair of chromosomes, allowing variants to be removed if they meets some established criterias. To classify variants it is necessary to pass two VCF files, one for each reference phase. After that, the PHASEfilter will go through the variants called in reference A and check if there are any homologous in the variants called in reference B. For each variant called in the reference A it can happen three situations: 1) both references, for the position in analysis, are equal and the variant is valid; 2) the position is heterozygous in the references and the variant reflects it, so the variant is removed; 3) the position is heterozygous in the references and the variant is homozygous. It goes to the valid variants file but it also go to the Loss Of Heterozygous (LOH) file. The variant file in analysis it is always the one passed in parameter '--vcf1'.

$ phasefilter --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir

## you can use chain if exists
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir --chain_A_B temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain --chain_B_A temp_raw_data/Assembly22_hapB_To_Assembly22_hapA.over.chain

Eighth possible files will be created after the commands ends. The outputs are from refrence A (ref1) to reference B (ref2), and from reference B (ref2) to reference A (ref1).

report_[A]to[B].txt - has the statistics about the analysis;
valid_[A]to[B].vcf.gz - has all variants that are not heterozygous between two references;
removed_[A]to[B].vcf.gz - has all heterozygous variants;
LOH_[A]to[B].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.
report_[B]to[A].txt - has the statistics about the analysis from ;
valid_[B]to[A].vcf.gz - has all variants that are not heterozygous between two references;
removed_[B]to[A].vcf.gz - has all heterozygous variants;
LOH_[B]to[A].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.

Headings description in report files:

Heterozygous (Removed) Heterozygous identified and they go the re remove_[YYY]_to[XXX].vcf.gz file
Keep alleles Alleles present in valid_[YYY]_to[XXX].vcf.gz file
LOH alleles Loss of Heterozygous They are in valid_[YYY]to[XXX].vcf.gz and LOH[YYY]_to[XXX].vcf.gz file.
Other than SNP Other variants thar are not SNPs and INDELs and they go to valid_[YYY]_to[XXX].vcf.gz file
Don't have hit position Variants that don’t have position in hit (ref B) genome and they go to valid_[YYY]_to[XXX].vcf.gz file
Could Not Fetch VCF Record on Hit Variants that are present in source file but not in hit VCF file. They go to valid_[YYY]_to[XXX].vcf.gz file
Total alleles All the alleles present in the source vcf file. Analyzed alleles.
Total Alleles new Source VCF Total alleles that are in valid_[YYY]_to[XXX].vcf.gz file
Method Alignment method.
Alignment % Percentage of alignment.

Note: You can can copy some example data to test the commands. {: .note}

Filter variants in phased genomes but only one direction

This tool do as the same of the previous script but only analysis from Reference A (ref1) to Reference B (ref2)

$ phasefilter_single --help
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz

## with chain
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz --chain temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain

Synchronize annotation genomes

Synchronize annotations genomes adapting the annotations that are in reference 1 to the reference 2, adding the tags 'StartHit' and 'EndHit' to the result file. In VCF type files only add 'StartHit' tag in Info. The annotations (input file need to be in VCF or GFF3 and belong to the reference 1.

$ synchronize_genomes --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ synchronize_genomes --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --gff temp_raw_data/T1_Fluc_7A_snps.gff3.gz --out T1_Fluc_7A_snps.sync.gff3.gz
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --gff S288C_reference.gff3 --out result.gff3 --pass_chr chrmt
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --vcf S288C_reference.vcf.gz --out result.vcf.gz

Make alignments

Obtain the percentage of the minimap2 alignment between chromosomes and create an output in ClustalX format.

$ make_alignment --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ make_alignment --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --out report.txt
$ make_alignment --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --out report.txt --pass_chr chrmt --out_alignment syncronizationSacharo

Reference Statistics

With this application it is possible to obtain the number of nucleotides by chromosome.

$ reference_statistics --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ reference_statistics --ref temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --out report_stats.txt
$ reference_statistics --ref Ca22chr1A_C_albicans_SC5314.fasta.gz --out retport.txt

Copy some example data to test all tools

It is possible to copy some example data and test the tools available

$ copy_raw_data_example_phasefilter --help
$ copy_raw_data_example_phasefilter --out temp_dir

Documentation

PHASEfilter documentation is available in ReadTheDocs: PHASEfilter

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

May 5, 2023

0.3.7

Jan 7, 2022

0.3.6

Dec 30, 2021

0.3.5

Dec 21, 2021

0.3.4

Sep 22, 2021

0.3.3

Sep 20, 2021

0.3.2

Sep 20, 2021

0.0.1a1 pre-release yanked

Jul 7, 2021

Reason this release was yanked:

Several bugs

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PHASEfilter-1.1.0.tar.gz (61.7 kB view details)

Uploaded May 5, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

PHASEfilter-1.1.0-py3-none-any.whl (5.3 MB view details)

Uploaded May 5, 2023 Python 3

File details

Details for the file PHASEfilter-1.1.0.tar.gz.

File metadata

Download URL: PHASEfilter-1.1.0.tar.gz
Upload date: May 5, 2023
Size: 61.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.0 CPython/3.6.9

File hashes

Hashes for PHASEfilter-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bc1db109e9d28168138ce32e62bf36e19ca3e17650fc962c9dc0426d12adbd59`
MD5	`ab9e8f332ff281a27699d1ff95b754c7`
BLAKE2b-256	`454f4afad6213ebd3b21f8a36cbcd8b17c3c6264054507b9790378441c6fd589`

See more details on using hashes here.

File details

Details for the file PHASEfilter-1.1.0-py3-none-any.whl.

File metadata

Download URL: PHASEfilter-1.1.0-py3-none-any.whl
Upload date: May 5, 2023
Size: 5.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/39.1.0 requests-toolbelt/0.9.1 tqdm/4.31.0 CPython/3.6.9

File hashes

Hashes for PHASEfilter-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`087e4d5940560b121fd64e4de84c809fbe3c09c6eb8aedb4a9532e8de21acacf`
MD5	`dc508fdc00e06904f6a871582af22cad`
BLAKE2b-256	`e80cb1ba707d308767faf0768f96d331da1e112b9a337f43b56ee9f8d6d8aff6`

See more details on using hashes here.

PHASEfilter 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PHASEfilter

Installation

Install directly

Install with virtualenv

Install with conda

All software available

Filter variants in phased genomes

Filter variants in phased genomes but only one direction

Synchronize annotation genomes

Make alignments

Reference Statistics

Copy some example data to test all tools

Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes