Search calcyanin in a set of amino acid sequences

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

PCALF stand for Python CALcyanin Finder.

Benzerara, K., Duprat, E., Bitard-Feildel, T., Caumes, G., Cassier-Chauvat, C., Chauvat, F., ... & Callebaut, I. (2022). A new gene family diagnostic for intracellular biomineralization of amorphous Ca carbonates by cyanobacteria. Genome Biology and Evolution, 14(3), evac026.

The ccyA gene is searched at the protein level following a simple decision tree based on the specific composition of the C-ter extremity of this protein. We use a weighted HMM profile specific of the glycine zipper triplication (aka GlyX3) to detect sequences with a putative glycine triplication. Sequences with at least one hit against this profile are annotated with three HMM profiles specific of each Glycine zipper : Gly1, Gly2 and Gly3. A set of known calcyanins is used to infere the type of the N-ter extremity (X, Z, Y, CoBaHMA or ?). Finally, a flag is assign to each sequence depending on its N-terminus type and its C-terminus modular organization.

Decision tree :

Installation :

mamba create -n pcalf -c k2sohigh pcalf;

pip3 install pcalf;

Dependencies :

python==3.9
pyhmmer==0.7.1
biopython==1.81
numpy>=1.21
pyyaml>=6
snakemake==7.22
pandas==1.5.3
tqdm==4.64.1
plotly==5.11.0
python-igraph==0.10.4

External dependency :

blast

Usage :

Pcalf is composed of several commands :

pcalf
pcalf-datasets-workflow
pcalf-annotate-workflow
pcalf-report

pcalf :

This command can be used to look quickly for the presence of calcyanin in a set of amino acid sequences. It take one or more fasta files as input and output several files including a summary, a list of features and a list of raw hits produced during the search. It also produce updated HMMs and updated MSAs with newly detected calcyanin tagged as Calcyanin with known N-ter.

pcalf -i proteins.fasta 
      -o pcalf_results 
      --iterative-search
      --iterative-update 
      --gly1-msa custom_gly1_msa.fasta

--iterative-search : If True, the search will be performed with profiles produced by the previous iteration until there is no new sequence detected or if --max-iteration is reached.

--iterative-update : True calcyanins (if any) will be added to the HMMs profiles iteratively starting with the best sequence (based on feature' score).

--gly1-msa : Use another MSA instead of the default one. A HMM profile will be built from the given MSA and Glycine weight will be increased.

Thresholds:

By default, coverage and E-value threshold are infered from the HMM profiles that will be used for the search. To define the thresholds, a MSA is converted into a simple fasta file by deleting all gaps and aligned against its own HMM profile. A soft filtering is made using a coverage threshold of 0.5. The maximum E-values and the minimum coverage values are used to define the thresholds as follows : max(E-value)*10, min(coverage)-0.1

pcalf-datasets-workflow :

pcalf-datasets-workflow can be used to retrieve genomes from NCBI databases such as RefSeq and GenBank based on accession (e.g GCA_012769535.1) or TaxID.

Genomes and annotations (genes and CDS) will be downloaded using the new command line tools from NCBI. If annotations does not exists for a genome, then genes and CDS will be predicted using Prodigal.

A yaml file is also produced and can be used as input for pcalf-annotate-workflow.

GCA_012769535.1:
  genome : /path/to/genome.gz
  cds_faa: /path/to/cds_faa.gz
  cds_fna: /path/to/cds_fna.gz
GCF_012769535.1:
...

example :

pcalf-datasets-workflow -t 1117 
                        -o pcalf_datasets_results 
                        -a file_with_accession.txt 
                        -e file_with_accession_to_ignore.txt

The command above will download all cyanobacteria genome (taxid : 1117) and genomes specified in file_with_accession.txt. If any of them are specified in file_with_accession_to_ignore.txt, then they will be ignored.

pcalf-annotate-workflow :

pcalf-annotate-workflow is actually the workflow that will help you recover calcyanin proteins and ccyA genes from a set of genomes. This workflow is composed of multiple steps :

genome taxonomic classification using GTDB-TK V2
genome quality assessment using CheckM
calcyanin detection using pcalf
calcyanin / ccyA linking using a specific python script.
NCBI metadatas recovery using NCBI

Note, that GTDB-TK and checkM requires external databases, respectively GTDB and CheckM datas. In addition, it's advised to run GTDB-TK and CheckM on a computer cluster. Because pcalf-annotate-workflow rely on snakemake you can easily provide a snakemake profile through the --snakargs option to run it on your favorite cluster. On the other hand, you can skip the genome taxonomic classification and the quality assessment with the --quick flag.

pcalf-annotate-workflow take as input a yaml file with a specific format, see pcalf-datasets-workflow for details.

example :

pcalf-annotate-workflow -i input_file.yaml 
                        -o pcalf_annotate_results 
                        --db db_file_from_another_run.sqlite3 
                        --snakargs "--profile my_slurm_profile --use-conda --jobs 50" 
                        --gtdb my_gtdb_directory 
                        --checkm my_checkm_datas_directory

The command above will process all the genome specified in input_file.yaml through the pcalf-annotate-workflow including checkm and gtdb-tk steps. The sqlite3 file produced will be merged with db_file_from_another_run.sqlite3. The workflow will be ran on your computer cluster with 50 jobs at a time. See snakemake documentation for details about cluster execution.

Several output files for each step will be produced but the final output is a sqlite3 database storing multiple tables:

- genome          # NCBI metadatas
- gtdbtk          # GTDB-TK classification results
- checkm          # Checkm Results 
- summary         # PCALF summary table
- features        # PCALF features table
- hits            # PCALF hits table
- ccyA            # ccyA table
- gly1            # Gly1 MSA
- gly2            # Gly2 MSA
- gly3            # Gly3 MSA
- glyx3           # Glyx3 MSA
- nterdb          # N-ter table

You can use the sqlite3 database from another run as a basis for a new one. In this case, MSAs stored in the sqlite3 file will be used to generate HMM profiles for pcalf.

pcalf-report :

This command produce an HTML report from a sqlite3 database given by pcalf annotate workflow.

example :

pcalf-report --db sqlite3_file.sqlite3 --out report.html

Workflow :

Workflow description

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.2.6

Sep 24, 2023

1.2.5

Sep 24, 2023

1.2.4

Sep 24, 2023

1.2.3

Sep 24, 2023

1.2.2

Aug 3, 2023

1.2.1

Jul 26, 2023

1.2.0

Jul 26, 2023

1.1.0

Jun 12, 2023

0.0.3

Apr 13, 2023

0.0.2

Apr 13, 2023

0.0.1

Apr 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcalf-1.2.6.tar.gz (356.4 kB view details)

Uploaded Sep 24, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pcalf-1.2.6-py3-none-any.whl (385.6 kB view details)

Uploaded Sep 24, 2023 Python 3

File details

Details for the file pcalf-1.2.6.tar.gz.

File metadata

Download URL: pcalf-1.2.6.tar.gz
Upload date: Sep 24, 2023
Size: 356.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pcalf-1.2.6.tar.gz
Algorithm	Hash digest
SHA256	`54e7c3d733cdefe78da1d35bdc7b5c7d420585a59dd00e4afc3f37d9429a2e4d`
MD5	`cef483fb14f67a3d25184fb93a57bd43`
BLAKE2b-256	`04f482034f2d672f5c4110cdd5f56ea0f9360a336b9388ce87375ba46d049fa2`

See more details on using hashes here.

File details

Details for the file pcalf-1.2.6-py3-none-any.whl.

File metadata

Download URL: pcalf-1.2.6-py3-none-any.whl
Upload date: Sep 24, 2023
Size: 385.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pcalf-1.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c15649ecc4d8ecb4e46b59d54631a57aeb05bbc88a0b1487261440cbe356f86a`
MD5	`0af139bed16ee7eb77d7f2224c49ace8`
BLAKE2b-256	`928a588a0d518fc9d09d4f143e1a0fee08c78d3604a84559f0793c303cd6c3b9`

See more details on using hashes here.

pcalf 1.2.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

Table of Contents :

Calcyanin detection :

Decision tree :

Installation :

Dependencies :

External dependency :

Usage :

pcalf :

Thresholds:

pcalf-datasets-workflow :

pcalf-annotate-workflow :

pcalf-report :

Workflow :

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes