A Python tool for comprehensive codon usage analysis and gene alignment extraction.

These details have not been verified by PyPI

Project links

Project description

PyCodon Analyzer (pycodon_analyzer)

A Python tool for comprehensive codon usage analysis and gene alignment extraction.

Overview

pycodon_analyzer is a command-line tool with two main functionalities:

extract: Extracts individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.
analyze: Performs codon usage analysis on a directory of pre-extracted and aligned gene FASTA files. This can be enhanced by providing an optional metadata file to generate more contextual and stratified analyses.

The analyze command calculates a wide range of codon usage indices and sequence properties for each gene, as well as for concatenated sequences ("complete" coding sequence per original genome ID). It aggregates results, performs Correspondence Analysis (CA), computes statistics, and generates various plots. When metadata is provided, it can generate additional sets of plots colored by specified metadata categories.

The tool leverages multiprocessing for faster analysis and uses Python's standard logging module (enhanced with rich for better console output and progress bars) for informative feedback.

Features

Common Features

Logging: Provides informative console output using Python's standard logging module, enhanced with rich for better readability and progress bars (use -v for DEBUG level).
Code Quality: Includes type hints, structured error handling, and refactored modules for better maintainability.
Robust Input Handling: Improved parsing for reference files (delimiter detection) and more resilient processing pipelines.

`extract` Subcommand

Input:
- Whole Genome Multiple Sequence Alignment (FASTA format).
- Reference Annotation File: Currently supports a multi-FASTA format where sequence headers contain GenBank-style feature tags like [gene=GENE_NAME] or [locus_tag=LOCUS_TAG] and [location=START..END].
- ID of the reference sequence within the alignment.
Processing:
- Parses gene coordinates (start, end, strand) from the annotation file.
- Maps these ungapped coordinates to the gapped positions in the aligned reference sequence.
- Extracts the alignment columns corresponding to each gene for all sequences in the MSA.
- Handles reverse-complementation for genes on the negative strand.
Output:
- Writes a separate FASTA alignment file for each successfully extracted gene, in the format gene_GENENAME.fasta. These files are suitable for direct input into the analyze subcommand.

`analyze` Subcommand

Input:
- Reads pre-extracted gene alignments from multiple FASTA files within a directory (requires gene_GENENAME.fasta naming convention).
- Optional Metadata: Accepts a CSV/TSV file (--metadata) to associate sequences with external categorical or numerical data. Sequence identifiers in the metadata (--metadata_id_col) must match original FASTA sequence IDs.
Sequence Cleaning: Performs robust cleaning before analysis:
- Removes gap characters (-).
- Validates sequence length (multiple of 3 after gap removal).
- Conditionally removes standard START (ATG) and STOP (TAA, TAG, TGA) codons.
- Replaces IUPAC ambiguous DNA characters with 'N'.
- Filters sequences exceeding a defined ambiguity threshold (default: 15% 'N', configurable).
Calculated Metrics: Computes the following for each gene and for concatenated "complete" sequences:
- Codon Counts & Frequencies (aggregate and per-sequence).
- RSCU (Relative Synonymous Codon Usage) (aggregate and per-sequence for CA input).
- GC Content: Overall GC%, GC1, GC2, GC3, GC12.
- ENC (Effective Number of Codons).
- CAI (Codon Adaptation Index) - Requires reference file (--ref).
- RCDI (Relative Codon Deoptimization Index) - Requires reference file (--ref).
- Fop (Frequency of Optimal Codons) - Requires reference file (--ref).
- Protein Properties: GRAVY (Grand Average of Hydropathicity) & Aromaticity %.
- Nucleotide & Dinucleotide Frequencies (aggregate per gene/complete set, and per-sequence for metadata-driven dinucleotide plots).
- Relative Dinucleotide Abundance (O/E ratios, aggregate per gene/complete set, and per-sequence for metadata-driven plots).
Metadata Integration:
- Merges provided metadata with per-sequence analysis results.
- Allows for generation of plots colored by a specified metadata column (--color_by_metadata).
- Limits the number of categories shown in metadata-colored plots (--metadata_max_categories, default 15, others grouped as "Other").
Statistical Analysis:
- Performs Kruskal-Wallis H-test (default) or ANOVA to compare key metrics between different genes.
Multivariate Analysis:
- Performs Correspondence Analysis (CA) on combined RSCU data from all genes.
- Performs Correspondence Analysis (CA) on RSCU data for each gene individually if metadata-based coloring is requested.
- Generates a correlation heatmap between CA axes (Dim1, Dim2 of combined CA) and other features.
Output Tables (CSV): (All tables are linked and viewable within the HTML report)
- per_sequence_metrics_all_genes.csv: Comprehensive metrics for every valid sequence, including merged metadata if provided.
- mean_features_per_gene.csv: Average values for key metrics per gene.
- gene_sequence_summary.csv: Summary of sequence counts and lengths per gene.
- per_sequence_rscu_wide.csv: RSCU value for every codon for every sequence (input for combined CA).
- gene_comparison_stats.csv: Results of statistical tests between genes.
- ca_row_coordinates.csv, ca_col_coordinates.csv, ca_col_contributions.csv, ca_eigenvalues.csv: Detailed results from the combined CA.
- ca_axes_vs_metadata_correlation.csv: (If implemented) Correlations between CA axes and numerical metadata.
Output Plots (Default and Metadata-Driven): (All plots are embedded within the HTML report)
- Standard Combined Plots (in main output directory):
  - RSCU_boxplot_GENENAME.(fmt): RSCU distribution per codon for each gene/complete set.
  - gc_means_barplot_by_Gene.(fmt): Mean GC values grouped by gene.
  - neutrality_plot_grouped_by_Gene.(fmt): GC12 vs GC3, colored by gene.
  - enc_vs_gc3_plot_grouped_by_Gene.(fmt): ENC vs GC3, colored by gene, with Wright's curve.
  - relative_dinucleotide_abundance.(fmt): Aggregate O/E ratio for dinucleotides, lines colored by gene.
  - ca_biplot_compXvY_combined_by_gene.(fmt): Combined CA biplot, points colored by gene.
  - CA diagnostics: ca_variance_explained_topN.(fmt), ca_contribution_dimX_topN.(fmt).
  - feature_correlation_heatmap_METHOD.(fmt): Correlation between calculated metrics.
  - ca_axes_feature_corr_METHOD.(fmt): Correlation between combined CA axes and other features.
- Per-Gene Plots Colored by Metadata (if --color_by_metadata <COL> is used):
  - Saved in: output_dir/images/<METADATA_COL>_per_gene_plots/<GENE_NAME>/
  - For each gene (and "complete" set):
    - enc_vs_gc3_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.
    - neutrality_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.
    - ca_biplot_compXvY_<GENE>_by_<META_COL>.(fmt): CA on sequences of this gene only, points colored by metadata.
    - dinucl_abundance_<GENE>_by_<META_COL>.(fmt): Mean per-sequence dinucleotide O/E ratios, lines colored by metadata categories.
Performance: Uses multiprocessing to process gene files in parallel.

Prerequisites

Python 3.8 or higher.
Git (for cloning).

Dependencies

The tool requires the following Python libraries:

biopython >= 1.79
pandas >= 1.3.0
matplotlib >= 3.4.0
seaborn >= 0.11.0
numpy >= 1.21.0
scipy >= 1.6.0
prince >= 0.12.1
scikit-learn >= 1.0 (Used by prince for CA functionalities)
adjustText >= 0.8 (Recommended for better label placement in plots)
rich >= 10.0 (For enhanced console logging and progress bars)
importlib-resources >= 1.0 ; python_version<"3.9" (For package data access in older Python versions)

These will be installed automatically when using pip install ..

Installation

Clone the repository:

git clone https://github.com/GabrielFalque/pycodon_analyzer.git
cd pycodon_analyzer

(Recommended) Create and activate a virtual environment:

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

# Windows (cmd/powershell)
python -m venv .venv
.\.venv\Scripts\activate

Install the tool and its dependencies:
```
pip install .
```
(Optional) Install development dependencies (for running tests, linting, building):
```
pip install -e ".[dev]"
```
(Note: The -e installs in editable mode, useful for development. Quotes around .[dev] can be necessary in some shells like zsh.)

Usage

pycodon_analyzer now operates using subcommands: extract and analyze.

pycodon_analyzer <subcommand> --help

This will show help for the specific subcommand.

1. `extract` Subcommand

Use this command to extract individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.

Synopsis:

pycodon_analyzer extract --annotations <PATH_TO_ANNOTATION_FILE> \
                         --alignment <PATH_TO_MSA_FILE> \
                         --ref_id <REFERENCE_SEQUENCE_ID> \
                         --output_dir <OUTPUT_DIRECTORY>

Key Arguments for extract:

-a, --annotations FILE: Path to the reference gene annotation file (Required). Expected format is multi-FASTA where sequence headers contain [gene=NAME] or [locus_tag=TAG] and [location=START..END] tags.
-g, --alignment FILE: Path to the whole genome multiple sequence alignment file (FASTA format) (Required).
-r, --ref_id ID: Sequence ID of the reference genome as it appears in the alignment file (Required). This sequence is used for coordinate mapping.
-o, --output_dir DIR: Output directory where extracted gene alignment FASTA files (e.g., gene_GENENAME.fasta) will be saved (Required).
-v, --verbose: Increase output verbosity to DEBUG level for console and file logs.
(Run pycodon_analyzer extract --help for all options.)

Example for extract:

# Extract gene alignments from a whole genome MSA
pycodon_analyzer extract \
    -a my_annotations.fasta \
    -g whole_genome_alignment.fasta \
    -r NC_000913.3 \
    -o extracted_gene_alignments \
    -v

2. `analyze` Subcommand

Use this command to perform codon usage analysis on a directory of pre-extracted gene alignment files.

Synopsis:

pycodon_analyzer analyze --directory <PATH_TO_GENE_FASTA_DIR> \
                         --output <RESULTS_OUTPUT_DIR> \
                         [OPTIONS]

Key Arguments for analyze:

-d, --directory DIR: Path to the input directory containing gene_GENENAME.fasta files (Required).
-o, --output DIR: Path to the output directory for analysis results (Default: codon_analysis_results).
--ref FILE | human | none: Path to codon usage reference table. (Default: 'human' using bundled file).
--ref_delimiter DELIM: Delimiter for the reference file (e.g., ',', '\t'). Auto-detects if not provided.
-t, --threads INT: Number of processes for parallel gene file analysis (Default: 1, 0 or negative for all cores).
--max_ambiguity FLOAT: Max allowed 'N' percentage per sequence (0-100, Default: 15.0).
--metadata FILE: Optional path to a metadata file (CSV or TSV).
--metadata_id_col NAME: Column name in metadata for sequence IDs (Default: "seq_id").
--metadata_delimiter DELIM: Optional delimiter for metadata file. Auto-detects if not provided.
--color_by_metadata NAME: Metadata column to use for coloring per-gene plots.
--metadata_max_categories INT: Max metadata categories for plotting (Default: 15).
--plot_formats FMT [FMT ...]: Output plot format(s) (Default: png; choices: png, svg, pdf, jpg).
--skip_plots: Flag to disable all plot generation.
--skip_ca: Flag to disable combined Correspondence Analysis.
--ca_dims X Y: Indices for CA components in combined plot (Default: 0 1).
-v, --verbose: Increase output verbosity (DEBUG level logging).
--no-html-report: Flag to disable the generation of the comprehensive HTML report. (Default: report is generated).
(Run pycodon_analyzer analyze --help for all options.)

Example for analyze:

# Basic analysis using 4 cores and human reference
pycodon_analyzer analyze \
    -d extracted_gene_alignments/ \
    -o codon_analysis_output \
    --ref human \
    -t 4 \
    -v

# Analysis with metadata, generating per-gene plots colored by 'Clade'
pycodon_analyzer analyze \
    -d viral_genes/ \
    -o viral_analysis_with_clades \
    --metadata virus_metadata.csv \
    --metadata_id_col Sequence_Accession \
    --color_by_metadata Clade \
    -t 0 \
    -v

Output Directory Structure

When running the analyze command, results are organized into a dedicated output directory (specified by -o or --output). This directory will contain:

report.html (if not disabled by --no-html-report) The main interactive HTML report. It provides an overview, run parameters, and navigation to all detailed sections, embedding plots and linking to data tables.
data/ (Subdirectory) Contains all data tables generated during the analysis (primarily CSV format). Key files include:
- per_sequence_metrics_all_genes.csv: Comprehensive metrics for each valid sequence. If metadata provided, it's merged here.
- mean_features_per_gene.csv: Average values for key metrics per gene.
- gene_sequence_summary.csv: Sequence counts and length statistics per gene.
- gene_comparison_stats.csv: Results of statistical tests between genes.
- per_sequence_rscu_wide.csv: RSCU values (wide format) for combined CA.
- ca_*.csv: Files related to combined CA (coordinates, contributions, eigenvalues).
images/ (Subdirectory, if plots not skipped by --skip-plots) Contains all plot images generated.
- Combined Plots: Directly in images/ (e.g., overall GC, ENC vs GC3, combined CA).
- Per-Gene RSCU Boxplots: e.g., RSCU_boxplot_GENENAME.<fmt>.
- Metadata-Specific Plots (if --color_by_metadata is used): Organized into images/<METADATA_COLUMN_NAME>_per_gene_plots/<GENE_NAME>/. Includes ENC vs GC3, Neutrality, CA biplots, Dinucleotide abundance plots, all colored by metadata categories.
html/ (Subdirectory, if HTML report generated) Contains secondary HTML pages that make up the interactive report.
pycodon_analyzer.log (or custom name specified by --log-file) The detailed log file for this analysis run, located in the main output directory.

Workflow Example

Prepare Annotations: Ensure your reference annotation file is in the expected multi-FASTA format with [gene=...] and [location=...] tags, or adapt the extraction.py module to parse your format (e.g., GenBank, GFF3).

Extract Gene Alignments:

pycodon_analyzer extract -a my_ref_annotations.gb.fasta -g my_genome_msa.fasta -r ref_genome_id -o ./gene_alignments

Run Codon Analysis:

pycodon_analyzer analyze -d ./gene_alignments -o ./codon_analysis_results --ref human -t 0 -v

Reference File Format (`--ref` for `analyze`)

Required for CAI, Fop, RCDI calculations. Should be CSV or TSV with columns for 'Codon' and one of 'Frequency', 'Count', 'RSCU', 'Freq', or 'Frequency (per thousand)'. The tool prioritizes finding an 'RSCU' column if present. For meaningful CAI/RCDI interpretation, using a reference set based on highly expressed genes of the target organism is recommended.

Development

Running Tests:

pip install -e .[dev]  # Ensure dev dependencies like pytest are installed
pytest

Type Checking:
```
mypy src
```

Linting/Formatting (Example using Ruff):

ruff check src tests
ruff format src tests

Building:
```
python -m build
```

TODO / Future Improvements

extract subcommand:
- Add support for standard GenBank and GFF3/GTF annotation file formats.
- Option to directly process unaligned CDS files (perform alignment if requested).
analyze subcommand:
- Implement tAI calculation (requires tRNA data input).
- Further metadata integration:
  - Allow statistical comparisons grouped by metadata categories.
  - Correlate CA axes with numerical metadata columns.
  - Option to filter analysis based on metadata.
- Add more statistical comparison options (e.g., pairwise tests).
General:
- Implement CI/CD pipeline (e.g., GitHub Actions).
- Consider interactive plots (Plotly/Bokeh).
- Publish package to PyPI.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Jun 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycodon_analyzer-1.0.1.tar.gz (160.4 kB view details)

Uploaded Jun 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycodon_analyzer-1.0.1-py3-none-any.whl (126.7 kB view details)

Uploaded Jun 10, 2025 Python 3

File details

Details for the file pycodon_analyzer-1.0.1.tar.gz.

File metadata

Download URL: pycodon_analyzer-1.0.1.tar.gz
Upload date: Jun 10, 2025
Size: 160.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for pycodon_analyzer-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9f55f2bea72a07af18dbe35ec9746050e77a05deee42e5945fc98b1776b85575`
MD5	`a802e597a93e577bd49803e51bbb8586`
BLAKE2b-256	`7dca0369888b0d735eb589e480c188abee77e7197f5ec9ff3a64b2741e13bdc4`

See more details on using hashes here.

File details

Details for the file pycodon_analyzer-1.0.1-py3-none-any.whl.

File metadata

Download URL: pycodon_analyzer-1.0.1-py3-none-any.whl
Upload date: Jun 10, 2025
Size: 126.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for pycodon_analyzer-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`606897aaccff2432b1b4f9ab13db14d9b0e9da80d4810522d2edeb5f2f10acd2`
MD5	`929cb07b0e68371816cb1e5a12b3af78`
BLAKE2b-256	`81c15506fae367792c7b8e50f677d5f92de46acca24275cf03bbcf2aa5e0ee41`

See more details on using hashes here.

pycodon-analyzer 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyCodon Analyzer (pycodon_analyzer)

Overview

Features

Common Features

extract Subcommand

analyze Subcommand

Prerequisites

Dependencies

Installation

Usage

1. extract Subcommand

2. analyze Subcommand

Output Directory Structure

Workflow Example

Reference File Format (--ref for analyze)

Development

TODO / Future Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`extract` Subcommand

`analyze` Subcommand

1. `extract` Subcommand

2. `analyze` Subcommand

Reference File Format (`--ref` for `analyze`)