Skip to main content

A Python tool for comprehensive codon usage analysis and gene alignment extraction.

Project description

PyCodon Analyzer (pycodon_analyzer)

A Python tool for comprehensive codon usage analysis and gene alignment extraction.

PyPI version License: MIT

Overview

pycodon_analyzer is a command-line tool with two main functionalities:

  1. extract: Extracts individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.
  2. analyze: Performs codon usage analysis on a directory of pre-extracted and aligned gene FASTA files. This can be enhanced by providing an optional metadata file to generate more contextual and stratified analyses.

The analyze command calculates a wide range of codon usage indices and sequence properties for each gene, as well as for concatenated sequences ("complete" coding sequence per original genome ID). It aggregates results, performs Correspondence Analysis (CA), computes statistics, and generates various plots. When metadata is provided, it can generate additional sets of plots colored by specified metadata categories.

The tool leverages multiprocessing for faster analysis and uses Python's standard logging module (enhanced with rich for better console output and progress bars) for informative feedback.

Features

Common Features

  • Logging: Provides informative console output using Python's standard logging module, enhanced with rich for better readability and progress bars (use -v for DEBUG level).
  • Code Quality: Includes type hints, structured error handling, and refactored modules for better maintainability.
  • Robust Input Handling: Improved parsing for reference files (delimiter detection) and more resilient processing pipelines.

extract Subcommand

  • Input:
    • Whole Genome Multiple Sequence Alignment (FASTA format).
    • Reference Annotation File: Currently supports a multi-FASTA format where sequence headers contain GenBank-style feature tags like [gene=GENE_NAME] or [locus_tag=LOCUS_TAG] and [location=START..END].
    • ID of the reference sequence within the alignment.
  • Processing:
    • Parses gene coordinates (start, end, strand) from the annotation file.
    • Maps these ungapped coordinates to the gapped positions in the aligned reference sequence.
    • Extracts the alignment columns corresponding to each gene for all sequences in the MSA.
    • Handles reverse-complementation for genes on the negative strand.
  • Output:
    • Writes a separate FASTA alignment file for each successfully extracted gene, in the format gene_GENENAME.fasta. These files are suitable for direct input into the analyze subcommand.

analyze Subcommand

  • Input:
    • Reads pre-extracted gene alignments from multiple FASTA files within a directory (requires gene_GENENAME.fasta naming convention).
    • Optional Metadata: Accepts a CSV/TSV file (--metadata) to associate sequences with external categorical or numerical data. Sequence identifiers in the metadata (--metadata_id_col) must match original FASTA sequence IDs.
  • Sequence Cleaning: Performs robust cleaning before analysis:
    • Removes gap characters (-).
    • Validates sequence length (multiple of 3 after gap removal).
    • Conditionally removes standard START (ATG) and STOP (TAA, TAG, TGA) codons.
    • Replaces IUPAC ambiguous DNA characters with 'N'.
    • Filters sequences exceeding a defined ambiguity threshold (default: 15% 'N', configurable).
  • Calculated Metrics: Computes the following for each gene and for concatenated "complete" sequences:
    • Codon Counts & Frequencies (aggregate and per-sequence).
    • RSCU (Relative Synonymous Codon Usage) (aggregate and per-sequence for CA input).
    • GC Content: Overall GC%, GC1, GC2, GC3, GC12.
    • ENC (Effective Number of Codons).
    • CAI (Codon Adaptation Index) - Requires reference file (--ref).
    • RCDI (Relative Codon Deoptimization Index) - Requires reference file (--ref).
    • Fop (Frequency of Optimal Codons) - Requires reference file (--ref).
    • Protein Properties: GRAVY (Grand Average of Hydropathicity) & Aromaticity %.
    • Nucleotide & Dinucleotide Frequencies (aggregate per gene/complete set, and per-sequence for metadata-driven dinucleotide plots).
    • Relative Dinucleotide Abundance (O/E ratios, aggregate per gene/complete set, and per-sequence for metadata-driven plots).
  • Metadata Integration:
    • Merges provided metadata with per-sequence analysis results.
    • Allows for generation of plots colored by a specified metadata column (--color_by_metadata).
    • Limits the number of categories shown in metadata-colored plots (--metadata_max_categories, default 15, others grouped as "Other").
  • Statistical Analysis:
    • Performs Kruskal-Wallis H-test (default) or ANOVA to compare key metrics between different genes.
  • Multivariate Analysis:
    • Performs Correspondence Analysis (CA) on combined RSCU data from all genes.
    • Performs Correspondence Analysis (CA) on RSCU data for each gene individually if metadata-based coloring is requested.
    • Generates a correlation heatmap between CA axes (Dim1, Dim2 of combined CA) and other features.
  • Output Tables (CSV): (All tables are linked and viewable within the HTML report)
    • per_sequence_metrics_all_genes.csv: Comprehensive metrics for every valid sequence, including merged metadata if provided.
    • mean_features_per_gene.csv: Average values for key metrics per gene.
    • gene_sequence_summary.csv: Summary of sequence counts and lengths per gene.
    • per_sequence_rscu_wide.csv: RSCU value for every codon for every sequence (input for combined CA).
    • gene_comparison_stats.csv: Results of statistical tests between genes.
    • ca_row_coordinates.csv, ca_col_coordinates.csv, ca_col_contributions.csv, ca_eigenvalues.csv: Detailed results from the combined CA.
    • ca_axes_vs_metadata_correlation.csv: (If implemented) Correlations between CA axes and numerical metadata.
  • Output Plots (Default and Metadata-Driven): (All plots are embedded within the HTML report)
    • Standard Combined Plots (in main output directory):
      • RSCU_boxplot_GENENAME.(fmt): RSCU distribution per codon for each gene/complete set.
      • gc_means_barplot_by_Gene.(fmt): Mean GC values grouped by gene.
      • neutrality_plot_grouped_by_Gene.(fmt): GC12 vs GC3, colored by gene.
      • enc_vs_gc3_plot_grouped_by_Gene.(fmt): ENC vs GC3, colored by gene, with Wright's curve.
      • relative_dinucleotide_abundance.(fmt): Aggregate O/E ratio for dinucleotides, lines colored by gene.
      • ca_biplot_compXvY_combined_by_gene.(fmt): Combined CA biplot, points colored by gene.
      • CA diagnostics: ca_variance_explained_topN.(fmt), ca_contribution_dimX_topN.(fmt).
      • feature_correlation_heatmap_METHOD.(fmt): Correlation between calculated metrics.
      • ca_axes_feature_corr_METHOD.(fmt): Correlation between combined CA axes and other features.
    • Per-Gene Plots Colored by Metadata (if --color_by_metadata <COL> is used):
      • Saved in: output_dir/images/<METADATA_COL>_per_gene_plots/<GENE_NAME>/
      • For each gene (and "complete" set):
        • enc_vs_gc3_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.
        • neutrality_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.
        • ca_biplot_compXvY_<GENE>_by_<META_COL>.(fmt): CA on sequences of this gene only, points colored by metadata.
        • dinucl_abundance_<GENE>_by_<META_COL>.(fmt): Mean per-sequence dinucleotide O/E ratios, lines colored by metadata categories.
  • Performance: Uses multiprocessing to process gene files in parallel.

Prerequisites

  • Python 3.8 or higher.
  • Git (for cloning).

Dependencies

The tool requires the following Python libraries:

  • biopython >= 1.79
  • pandas >= 1.3.0
  • matplotlib >= 3.4.0
  • seaborn >= 0.11.0
  • numpy >= 1.21.0
  • scipy >= 1.6.0
  • prince >= 0.12.1
  • scikit-learn >= 1.0 (Used by prince for CA functionalities)
  • adjustText >= 0.8 (Recommended for better label placement in plots)
  • rich >= 10.0 (For enhanced console logging and progress bars)
  • importlib-resources >= 1.0 ; python_version<"3.9" (For package data access in older Python versions)

These will be installed automatically when using pip install ..

Installation

  1. Clone the repository:
    git clone https://github.com/GabrielFalque/pycodon_analyzer.git
    cd pycodon_analyzer
    
  2. (Recommended) Create and activate a virtual environment:
    # Linux/macOS
    python3 -m venv venv
    source venv/bin/activate
    
    # Windows (cmd/powershell)
    python -m venv .venv
    .\.venv\Scripts\activate
    
  3. Install the tool and its dependencies:
    pip install .
    
  4. (Optional) Install development dependencies (for running tests, linting, building):
    pip install -e ".[dev]"
    
    (Note: The -e installs in editable mode, useful for development. Quotes around .[dev] can be necessary in some shells like zsh.)

Usage

pycodon_analyzer now operates using subcommands: extract and analyze.

pycodon_analyzer <subcommand> --help

This will show help for the specific subcommand.

1. extract Subcommand

Use this command to extract individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.

Synopsis:

pycodon_analyzer extract --annotations <PATH_TO_ANNOTATION_FILE> \
                         --alignment <PATH_TO_MSA_FILE> \
                         --ref_id <REFERENCE_SEQUENCE_ID> \
                         --output_dir <OUTPUT_DIRECTORY>

Key Arguments for extract:

  • -a, --annotations FILE: Path to the reference gene annotation file (Required). Expected format is multi-FASTA where sequence headers contain [gene=NAME] or [locus_tag=TAG] and [location=START..END] tags.
  • -g, --alignment FILE: Path to the whole genome multiple sequence alignment file (FASTA format) (Required).
  • -r, --ref_id ID: Sequence ID of the reference genome as it appears in the alignment file (Required). This sequence is used for coordinate mapping.
  • -o, --output_dir DIR: Output directory where extracted gene alignment FASTA files (e.g., gene_GENENAME.fasta) will be saved (Required).
  • -v, --verbose: Increase output verbosity to DEBUG level for console and file logs.
  • (Run pycodon_analyzer extract --help for all options.)

Example for extract:

# Extract gene alignments from a whole genome MSA
pycodon_analyzer extract \
    -a my_annotations.fasta \
    -g whole_genome_alignment.fasta \
    -r NC_000913.3 \
    -o extracted_gene_alignments \
    -v

2. analyze Subcommand

Use this command to perform codon usage analysis on a directory of pre-extracted gene alignment files.

Synopsis:

pycodon_analyzer analyze --directory <PATH_TO_GENE_FASTA_DIR> \
                         --output <RESULTS_OUTPUT_DIR> \
                         [OPTIONS]

Key Arguments for analyze:

  • -d, --directory DIR: Path to the input directory containing gene_GENENAME.fasta files (Required).
  • -o, --output DIR: Path to the output directory for analysis results (Default: codon_analysis_results).
  • --ref FILE | human | none: Path to codon usage reference table. (Default: 'human' using bundled file).
  • --ref_delimiter DELIM: Delimiter for the reference file (e.g., ',', '\t'). Auto-detects if not provided.
  • -t, --threads INT: Number of processes for parallel gene file analysis (Default: 1, 0 or negative for all cores).
  • --max_ambiguity FLOAT: Max allowed 'N' percentage per sequence (0-100, Default: 15.0).
  • --metadata FILE: Optional path to a metadata file (CSV or TSV).
  • --metadata_id_col NAME: Column name in metadata for sequence IDs (Default: "seq_id").
  • --metadata_delimiter DELIM: Optional delimiter for metadata file. Auto-detects if not provided.
  • --color_by_metadata NAME: Metadata column to use for coloring per-gene plots.
  • --metadata_max_categories INT: Max metadata categories for plotting (Default: 15).
  • --plot_formats FMT [FMT ...]: Output plot format(s) (Default: png; choices: png, svg, pdf, jpg).
  • --skip_plots: Flag to disable all plot generation.
  • --skip_ca: Flag to disable combined Correspondence Analysis.
  • --ca_dims X Y: Indices for CA components in combined plot (Default: 0 1).
  • -v, --verbose: Increase output verbosity (DEBUG level logging).
  • --no-html-report: Flag to disable the generation of the comprehensive HTML report. (Default: report is generated).
  • (Run pycodon_analyzer analyze --help for all options.)

Example for analyze:

# Basic analysis using 4 cores and human reference
pycodon_analyzer analyze \
    -d extracted_gene_alignments/ \
    -o codon_analysis_output \
    --ref human \
    -t 4 \
    -v

# Analysis with metadata, generating per-gene plots colored by 'Clade'
pycodon_analyzer analyze \
    -d viral_genes/ \
    -o viral_analysis_with_clades \
    --metadata virus_metadata.csv \
    --metadata_id_col Sequence_Accession \
    --color_by_metadata Clade \
    -t 0 \
    -v

Output Directory Structure

When running the analyze command, results are organized into a dedicated output directory (specified by -o or --output). This directory will contain:

  1. report.html (if not disabled by --no-html-report) The main interactive HTML report. It provides an overview, run parameters, and navigation to all detailed sections, embedding plots and linking to data tables.

  2. data/ (Subdirectory) Contains all data tables generated during the analysis (primarily CSV format). Key files include:

    • per_sequence_metrics_all_genes.csv: Comprehensive metrics for each valid sequence. If metadata provided, it's merged here.
    • mean_features_per_gene.csv: Average values for key metrics per gene.
    • gene_sequence_summary.csv: Sequence counts and length statistics per gene.
    • gene_comparison_stats.csv: Results of statistical tests between genes.
    • per_sequence_rscu_wide.csv: RSCU values (wide format) for combined CA.
    • ca_*.csv: Files related to combined CA (coordinates, contributions, eigenvalues).
  3. images/ (Subdirectory, if plots not skipped by --skip-plots) Contains all plot images generated.

    • Combined Plots: Directly in images/ (e.g., overall GC, ENC vs GC3, combined CA).
    • Per-Gene RSCU Boxplots: e.g., RSCU_boxplot_GENENAME.<fmt>.
    • Metadata-Specific Plots (if --color_by_metadata is used): Organized into images/<METADATA_COLUMN_NAME>_per_gene_plots/<GENE_NAME>/. Includes ENC vs GC3, Neutrality, CA biplots, Dinucleotide abundance plots, all colored by metadata categories.
  4. html/ (Subdirectory, if HTML report generated) Contains secondary HTML pages that make up the interactive report.

  5. pycodon_analyzer.log (or custom name specified by --log-file) The detailed log file for this analysis run, located in the main output directory.

Workflow Example

  1. Prepare Annotations: Ensure your reference annotation file is in the expected multi-FASTA format with [gene=...] and [location=...] tags, or adapt the extraction.py module to parse your format (e.g., GenBank, GFF3).
  2. Extract Gene Alignments:
    pycodon_analyzer extract -a my_ref_annotations.gb.fasta -g my_genome_msa.fasta -r ref_genome_id -o ./gene_alignments
    
  3. Run Codon Analysis:
    pycodon_analyzer analyze -d ./gene_alignments -o ./codon_analysis_results --ref human -t 0 -v
    

Reference File Format (--ref for analyze)

Required for CAI, Fop, RCDI calculations. Should be CSV or TSV with columns for 'Codon' and one of 'Frequency', 'Count', 'RSCU', 'Freq', or 'Frequency (per thousand)'. The tool prioritizes finding an 'RSCU' column if present. For meaningful CAI/RCDI interpretation, using a reference set based on highly expressed genes of the target organism is recommended.

Development

  • Running Tests:
    pip install -e .[dev]  # Ensure dev dependencies like pytest are installed
    pytest
    
  • Type Checking:
    mypy src
    
  • Linting/Formatting (Example using Ruff):
    ruff check src tests
    ruff format src tests
    
  • Building:
    python -m build
    

TODO / Future Improvements

  • extract subcommand:
    • Add support for standard GenBank and GFF3/GTF annotation file formats.
    • Option to directly process unaligned CDS files (perform alignment if requested).
  • analyze subcommand:
    • Implement tAI calculation (requires tRNA data input).
    • Further metadata integration:
      • Allow statistical comparisons grouped by metadata categories.
      • Correlate CA axes with numerical metadata columns.
      • Option to filter analysis based on metadata.
    • Add more statistical comparison options (e.g., pairwise tests).
  • General:
    • Implement CI/CD pipeline (e.g., GitHub Actions).
    • Consider interactive plots (Plotly/Bokeh).
    • Publish package to PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycodon_analyzer-1.0.1.tar.gz (160.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycodon_analyzer-1.0.1-py3-none-any.whl (126.7 kB view details)

Uploaded Python 3

File details

Details for the file pycodon_analyzer-1.0.1.tar.gz.

File metadata

  • Download URL: pycodon_analyzer-1.0.1.tar.gz
  • Upload date:
  • Size: 160.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for pycodon_analyzer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9f55f2bea72a07af18dbe35ec9746050e77a05deee42e5945fc98b1776b85575
MD5 a802e597a93e577bd49803e51bbb8586
BLAKE2b-256 7dca0369888b0d735eb589e480c188abee77e7197f5ec9ff3a64b2741e13bdc4

See more details on using hashes here.

File details

Details for the file pycodon_analyzer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pycodon_analyzer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 606897aaccff2432b1b4f9ab13db14d9b0e9da80d4810522d2edeb5f2f10acd2
MD5 929cb07b0e68371816cb1e5a12b3af78
BLAKE2b-256 81c15506fae367792c7b8e50f677d5f92de46acca24275cf03bbcf2aa5e0ee41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page