Skip to main content

Search calcyanin in a set of amino acid sequences

Project description

PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

PCALF stand for Python CALcyanin Finder.

Benzerara, K., Duprat, E., Bitard-Feildel, T., Caumes, G., Cassier-Chauvat, C., Chauvat, F., ... & Callebaut, I. (2022). A new gene family diagnostic for intracellular biomineralization of amorphous Ca carbonates by cyanobacteria. Genome Biology and Evolution, 14(3), evac026.



Table of Contents :



Calcyanin detection :

The ccyA gene is searched at the protein level following a simple decision tree based on the specific composition of the C-ter extremity of this protein. We use a weighted HMM profile specific of the glycine zipper triplication (aka GlyX3) to detect sequences with a putative glycine triplication. Sequences with at least one hit against this profile are annotated with three HMM profiles specific of each Glycine zipper : Gly1, Gly2 and Gly3. A set of known calcyanins is used to infere the type of the N-ter extremity (X, Z, Y, CoBaHMA or ?). Finally, a flag is assign to each sequence depending on its N-terminus type and its C-terminus modular organization.



Decision tree :

decision_tree

Installation :

mamba create -n pcalf -c k2sohigh pcalf;

or

pip3 install pcalf;

Dependencies :

python==3.9
pyhmmer==0.7.1
biopython==1.81
numpy>=1.21
pyyaml>=6
snakemake==7.22
pandas==1.5.3
tqdm==4.64.1
plotly==5.11.0
python-igraph==0.10.4

External dependency :

blast



Usage :

Pcalf is composed of several commands :


pcalf :

This command can be used to look quickly for the presence of calcyanin in a set of amino acid sequences. It take one or more fasta files as input and output several files including a summary, a list of features and a list of raw hits produced during the search. It also produce updated HMMs and updated MSAs with newly detected calcyanin tagged as Calcyanin with known N-ter.

pcalf -i proteins.fasta 
      -o pcalf_results 
      --iterative-search
      --iterative-update 
      --gly1-msa custom_gly1_msa.fasta

--iterative-search : If True, the search will be performed with profiles produced by the previous iteration until there is no new sequence detected or if --max-iteration is reached.

--iterative-update : True calcyanins (if any) will be added to the HMMs profiles iteratively starting with the best sequence (based on feature' score).

--gly1-msa : Use another MSA instead of the default one. A HMM profile will be built from the given MSA and Glycine weight will be increased.

Thresholds:

By default, coverage and E-value threshold are infered from the HMM profiles that will be used for the search. To define the thresholds, a MSA is converted into a simple fasta file by deleting all gaps and aligned against its own HMM profile. A soft filtering is made using a coverage threshold of 0.5. The maximum E-values ​​and the minimum coverage values ​​are used to define the thresholds as follows : max(E-value)*10, min(coverage)-0.1


pcalf-datasets-workflow :

pcalf-datasets-workflow can be used to retrieve genomes from NCBI databases such as RefSeq and GenBank based on accession (e.g GCA_012769535.1) or TaxID.

Genomes and annotations (genes and CDS) will be downloaded using the new command line tools from NCBI. If annotations does not exists for a genome, then genes and CDS will be predicted using Prodigal.

A yaml file is also produced and can be used as input for pcalf-annotate-workflow.

GCA_012769535.1:
  genome : /path/to/genome.gz
  cds_faa: /path/to/cds_faa.gz
  cds_fna: /path/to/cds_fna.gz
GCF_012769535.1:
...

example :

pcalf-datasets-workflow -t 1117 
                        -o pcalf_datasets_results 
                        -a file_with_accession.txt 
                        -e file_with_accession_to_ignore.txt

The command above will download all cyanobacteria genome (taxid : 1117) and genomes specified in file_with_accession.txt. If any of them are specified in file_with_accession_to_ignore.txt, then they will be ignored.


pcalf-annotate-workflow :

pcalf-annotate-workflow is actually the workflow that will help you recover calcyanin proteins and ccyA genes from a set of genomes. This workflow is composed of multiple steps :

  • genome taxonomic classification using GTDB-TK V2
  • genome quality assessment using CheckM
  • calcyanin detection using pcalf
  • calcyanin / ccyA linking using a specific python script.
  • NCBI metadatas recovery using NCBI

Note, that GTDB-TK and checkM requires external databases, respectively GTDB and CheckM datas. In addition, it's advised to run GTDB-TK and CheckM on a computer cluster. Because pcalf-annotate-workflow rely on snakemake you can easily provide a snakemake profile through the --snakargs option to run it on your favorite cluster. On the other hand, you can skip the genome taxonomic classification and the quality assessment with the --quick flag.

pcalf-annotate-workflow take as input a yaml file with a specific format, see pcalf-datasets-workflow for details.

example :

pcalf-annotate-workflow -i input_file.yaml 
                        -o pcalf_annotate_results 
                        --db db_file_from_another_run.sqlite3 
                        --snakargs "--profile my_slurm_profile --use-conda --jobs 50" 
                        --gtdb my_gtdb_directory 
                        --checkm my_checkm_datas_directory

The command above will process all the genome specified in input_file.yaml through the pcalf-annotate-workflow including checkm and gtdb-tk steps. The sqlite3 file produced will be merged with db_file_from_another_run.sqlite3. The workflow will be ran on your computer cluster with 50 jobs at a time. See snakemake documentation for details about cluster execution.

Several output files for each step will be produced but the final output is a sqlite3 database storing multiple tables:

- genome          # NCBI metadatas
- gtdbtk          # GTDB-TK classification results
- checkm          # Checkm Results 
- summary         # PCALF summary table
- features        # PCALF features table
- hits            # PCALF hits table
- ccyA            # ccyA table
- gly1            # Gly1 MSA
- gly2            # Gly2 MSA
- gly3            # Gly3 MSA
- glyx3           # Glyx3 MSA
- nterdb          # N-ter table

You can use the sqlite3 database from another run as a basis for a new one. In this case, MSAs stored in the sqlite3 file will be used to generate HMM profiles for pcalf.


pcalf-report :

This command produce an HTML report from a sqlite3 database given by pcalf annotate workflow.

example :

pcalf-report --db sqlite3_file.sqlite3 --out report.html



Workflow :

Workflow description

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcalf-1.2.6.tar.gz (356.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pcalf-1.2.6-py3-none-any.whl (385.6 kB view details)

Uploaded Python 3

File details

Details for the file pcalf-1.2.6.tar.gz.

File metadata

  • Download URL: pcalf-1.2.6.tar.gz
  • Upload date:
  • Size: 356.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pcalf-1.2.6.tar.gz
Algorithm Hash digest
SHA256 54e7c3d733cdefe78da1d35bdc7b5c7d420585a59dd00e4afc3f37d9429a2e4d
MD5 cef483fb14f67a3d25184fb93a57bd43
BLAKE2b-256 04f482034f2d672f5c4110cdd5f56ea0f9360a336b9388ce87375ba46d049fa2

See more details on using hashes here.

File details

Details for the file pcalf-1.2.6-py3-none-any.whl.

File metadata

  • Download URL: pcalf-1.2.6-py3-none-any.whl
  • Upload date:
  • Size: 385.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pcalf-1.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c15649ecc4d8ecb4e46b59d54631a57aeb05bbc88a0b1487261440cbe356f86a
MD5 0af139bed16ee7eb77d7f2224c49ace8
BLAKE2b-256 928a588a0d518fc9d09d4f143e1a0fee08c78d3604a84559f0793c303cd6c3b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page