A Python tool for comprehensive codon usage analysis and gene alignment extraction.
Project description
PyCodon Analyzer (pycodon_analyzer)
A Python tool for comprehensive codon usage analysis and gene alignment extraction.
Overview
pycodon_analyzer is a command-line tool with two main functionalities:
extract: Extracts individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.analyze: Performs codon usage analysis on a directory of pre-extracted and aligned gene FASTA files. This can be enhanced by providing an optional metadata file to generate more contextual and stratified analyses.
The analyze command calculates a wide range of codon usage indices and sequence properties for each gene, as well as for concatenated sequences ("complete" coding sequence per original genome ID). It aggregates results, performs Correspondence Analysis (CA), computes statistics, and generates various plots. When metadata is provided, it can generate additional sets of plots colored by specified metadata categories.
The tool leverages multiprocessing for faster analysis and uses Python's standard logging module (enhanced with rich for better console output and progress bars) for informative feedback.
Features
Common Features
- Logging: Provides informative console output using Python's standard
loggingmodule, enhanced withrichfor better readability and progress bars (use-vfor DEBUG level). - Code Quality: Includes type hints, structured error handling, and refactored modules for better maintainability.
- Robust Input Handling: Improved parsing for reference files (delimiter detection) and more resilient processing pipelines.
extract Subcommand
- Input:
- Whole Genome Multiple Sequence Alignment (FASTA format).
- Reference Annotation File: Currently supports a multi-FASTA format where sequence headers contain GenBank-style feature tags like
[gene=GENE_NAME]or[locus_tag=LOCUS_TAG]and[location=START..END]. - ID of the reference sequence within the alignment.
- Processing:
- Parses gene coordinates (start, end, strand) from the annotation file.
- Maps these ungapped coordinates to the gapped positions in the aligned reference sequence.
- Extracts the alignment columns corresponding to each gene for all sequences in the MSA.
- Handles reverse-complementation for genes on the negative strand.
- Output:
- Writes a separate FASTA alignment file for each successfully extracted gene, in the format
gene_GENENAME.fasta. These files are suitable for direct input into theanalyzesubcommand.
- Writes a separate FASTA alignment file for each successfully extracted gene, in the format
analyze Subcommand
- Input:
- Reads pre-extracted gene alignments from multiple FASTA files within a directory (requires
gene_GENENAME.fastanaming convention). - Optional Metadata: Accepts a CSV/TSV file (
--metadata) to associate sequences with external categorical or numerical data. Sequence identifiers in the metadata (--metadata_id_col) must match original FASTA sequence IDs.
- Reads pre-extracted gene alignments from multiple FASTA files within a directory (requires
- Sequence Cleaning: Performs robust cleaning before analysis:
- Removes gap characters (
-). - Validates sequence length (multiple of 3 after gap removal).
- Conditionally removes standard START (
ATG) and STOP (TAA,TAG,TGA) codons. - Replaces IUPAC ambiguous DNA characters with 'N'.
- Filters sequences exceeding a defined ambiguity threshold (default: 15% 'N', configurable).
- Removes gap characters (
- Calculated Metrics: Computes the following for each gene and for concatenated "complete" sequences:
- Codon Counts & Frequencies (aggregate and per-sequence).
- RSCU (Relative Synonymous Codon Usage) (aggregate and per-sequence for CA input).
- GC Content: Overall GC%, GC1, GC2, GC3, GC12.
- ENC (Effective Number of Codons).
- CAI (Codon Adaptation Index) - Requires reference file (
--ref). - RCDI (Relative Codon Deoptimization Index) - Requires reference file (
--ref). - Fop (Frequency of Optimal Codons) - Requires reference file (
--ref). - Protein Properties: GRAVY (Grand Average of Hydropathicity) & Aromaticity %.
- Nucleotide & Dinucleotide Frequencies (aggregate per gene/complete set, and per-sequence for metadata-driven dinucleotide plots).
- Relative Dinucleotide Abundance (O/E ratios, aggregate per gene/complete set, and per-sequence for metadata-driven plots).
- Metadata Integration:
- Merges provided metadata with per-sequence analysis results.
- Allows for generation of plots colored by a specified metadata column (
--color_by_metadata). - Limits the number of categories shown in metadata-colored plots (
--metadata_max_categories, default 15, others grouped as "Other").
- Statistical Analysis:
- Performs Kruskal-Wallis H-test (default) or ANOVA to compare key metrics between different genes.
- Multivariate Analysis:
- Performs Correspondence Analysis (CA) on combined RSCU data from all genes.
- Performs Correspondence Analysis (CA) on RSCU data for each gene individually if metadata-based coloring is requested.
- Generates a correlation heatmap between CA axes (Dim1, Dim2 of combined CA) and other features.
- Output Tables (CSV): (All tables are linked and viewable within the HTML report)
per_sequence_metrics_all_genes.csv: Comprehensive metrics for every valid sequence, including merged metadata if provided.mean_features_per_gene.csv: Average values for key metrics per gene.gene_sequence_summary.csv: Summary of sequence counts and lengths per gene.per_sequence_rscu_wide.csv: RSCU value for every codon for every sequence (input for combined CA).gene_comparison_stats.csv: Results of statistical tests between genes.ca_row_coordinates.csv,ca_col_coordinates.csv,ca_col_contributions.csv,ca_eigenvalues.csv: Detailed results from the combined CA.ca_axes_vs_metadata_correlation.csv: (If implemented) Correlations between CA axes and numerical metadata.
- Output Plots (Default and Metadata-Driven): (All plots are embedded within the HTML report)
- Standard Combined Plots (in main output directory):
RSCU_boxplot_GENENAME.(fmt): RSCU distribution per codon for each gene/complete set.gc_means_barplot_by_Gene.(fmt): Mean GC values grouped by gene.neutrality_plot_grouped_by_Gene.(fmt): GC12 vs GC3, colored by gene.enc_vs_gc3_plot_grouped_by_Gene.(fmt): ENC vs GC3, colored by gene, with Wright's curve.relative_dinucleotide_abundance.(fmt): Aggregate O/E ratio for dinucleotides, lines colored by gene.ca_biplot_compXvY_combined_by_gene.(fmt): Combined CA biplot, points colored by gene.- CA diagnostics:
ca_variance_explained_topN.(fmt),ca_contribution_dimX_topN.(fmt). feature_correlation_heatmap_METHOD.(fmt): Correlation between calculated metrics.ca_axes_feature_corr_METHOD.(fmt): Correlation between combined CA axes and other features.
- Per-Gene Plots Colored by Metadata (if
--color_by_metadata <COL>is used):- Saved in:
output_dir/images/<METADATA_COL>_per_gene_plots/<GENE_NAME>/ - For each gene (and "complete" set):
enc_vs_gc3_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.neutrality_plot_<GENE>_by_<META_COL>.(fmt): Sequences colored by metadata categories.ca_biplot_compXvY_<GENE>_by_<META_COL>.(fmt): CA on sequences of this gene only, points colored by metadata.dinucl_abundance_<GENE>_by_<META_COL>.(fmt): Mean per-sequence dinucleotide O/E ratios, lines colored by metadata categories.
- Saved in:
- Standard Combined Plots (in main output directory):
- Performance: Uses multiprocessing to process gene files in parallel.
Prerequisites
- Python 3.8 or higher.
- Git (for cloning).
Dependencies
The tool requires the following Python libraries:
biopython >= 1.79pandas >= 1.3.0matplotlib >= 3.4.0seaborn >= 0.11.0numpy >= 1.21.0scipy >= 1.6.0prince >= 0.12.1scikit-learn >= 1.0(Used byprincefor CA functionalities)adjustText >= 0.8(Recommended for better label placement in plots)rich >= 10.0(For enhanced console logging and progress bars)importlib-resources >= 1.0 ; python_version<"3.9"(For package data access in older Python versions)
These will be installed automatically when using pip install ..
Installation
- Clone the repository:
git clone https://github.com/GabrielFalque/pycodon_analyzer.git cd pycodon_analyzer
- (Recommended) Create and activate a virtual environment:
# Linux/macOS python3 -m venv venv source venv/bin/activate # Windows (cmd/powershell) python -m venv .venv .\.venv\Scripts\activate
- Install the tool and its dependencies:
pip install .
- (Optional) Install development dependencies (for running tests, linting, building):
pip install -e ".[dev]"
(Note: The-einstalls in editable mode, useful for development. Quotes around.[dev]can be necessary in some shells like zsh.)
Usage
pycodon_analyzer now operates using subcommands: extract and analyze.
pycodon_analyzer <subcommand> --help
This will show help for the specific subcommand.
1. extract Subcommand
Use this command to extract individual gene alignments from a whole genome multiple sequence alignment (MSA) based on a reference annotation file.
Synopsis:
pycodon_analyzer extract --annotations <PATH_TO_ANNOTATION_FILE> \
--alignment <PATH_TO_MSA_FILE> \
--ref_id <REFERENCE_SEQUENCE_ID> \
--output_dir <OUTPUT_DIRECTORY>
Key Arguments for extract:
-a, --annotations FILE: Path to the reference gene annotation file (Required). Expected format is multi-FASTA where sequence headers contain[gene=NAME]or[locus_tag=TAG]and[location=START..END]tags.-g, --alignment FILE: Path to the whole genome multiple sequence alignment file (FASTA format) (Required).-r, --ref_id ID: Sequence ID of the reference genome as it appears in the alignment file (Required). This sequence is used for coordinate mapping.-o, --output_dir DIR: Output directory where extracted gene alignment FASTA files (e.g.,gene_GENENAME.fasta) will be saved (Required).-v, --verbose: Increase output verbosity to DEBUG level for console and file logs.- (Run
pycodon_analyzer extract --helpfor all options.)
Example for extract:
# Extract gene alignments from a whole genome MSA
pycodon_analyzer extract \
-a my_annotations.fasta \
-g whole_genome_alignment.fasta \
-r NC_000913.3 \
-o extracted_gene_alignments \
-v
2. analyze Subcommand
Use this command to perform codon usage analysis on a directory of pre-extracted gene alignment files.
Synopsis:
pycodon_analyzer analyze --directory <PATH_TO_GENE_FASTA_DIR> \
--output <RESULTS_OUTPUT_DIR> \
[OPTIONS]
Key Arguments for analyze:
-d, --directory DIR: Path to the input directory containinggene_GENENAME.fastafiles (Required).-o, --output DIR: Path to the output directory for analysis results (Default:codon_analysis_results).--ref FILE | human | none: Path to codon usage reference table. (Default: 'human' using bundled file).--ref_delimiter DELIM: Delimiter for the reference file (e.g., ',', '\t'). Auto-detects if not provided.-t, --threads INT: Number of processes for parallel gene file analysis (Default: 1, 0 or negative for all cores).--max_ambiguity FLOAT: Max allowed 'N' percentage per sequence (0-100, Default: 15.0).--metadata FILE: Optional path to a metadata file (CSV or TSV).--metadata_id_col NAME: Column name in metadata for sequence IDs (Default: "seq_id").--metadata_delimiter DELIM: Optional delimiter for metadata file. Auto-detects if not provided.--color_by_metadata NAME: Metadata column to use for coloring per-gene plots.--metadata_max_categories INT: Max metadata categories for plotting (Default: 15).--plot_formats FMT [FMT ...]: Output plot format(s) (Default: png; choices: png, svg, pdf, jpg).--skip_plots: Flag to disable all plot generation.--skip_ca: Flag to disable combined Correspondence Analysis.--ca_dims X Y: Indices for CA components in combined plot (Default: 0 1).-v, --verbose: Increase output verbosity (DEBUG level logging).--no-html-report: Flag to disable the generation of the comprehensive HTML report. (Default: report is generated).- (Run
pycodon_analyzer analyze --helpfor all options.)
Example for analyze:
# Basic analysis using 4 cores and human reference
pycodon_analyzer analyze \
-d extracted_gene_alignments/ \
-o codon_analysis_output \
--ref human \
-t 4 \
-v
# Analysis with metadata, generating per-gene plots colored by 'Clade'
pycodon_analyzer analyze \
-d viral_genes/ \
-o viral_analysis_with_clades \
--metadata virus_metadata.csv \
--metadata_id_col Sequence_Accession \
--color_by_metadata Clade \
-t 0 \
-v
Output Directory Structure
When running the analyze command, results are organized into a dedicated output directory (specified by -o or --output). This directory will contain:
-
report.html(if not disabled by--no-html-report) The main interactive HTML report. It provides an overview, run parameters, and navigation to all detailed sections, embedding plots and linking to data tables. -
data/(Subdirectory) Contains all data tables generated during the analysis (primarily CSV format). Key files include:per_sequence_metrics_all_genes.csv: Comprehensive metrics for each valid sequence. If metadata provided, it's merged here.mean_features_per_gene.csv: Average values for key metrics per gene.gene_sequence_summary.csv: Sequence counts and length statistics per gene.gene_comparison_stats.csv: Results of statistical tests between genes.per_sequence_rscu_wide.csv: RSCU values (wide format) for combined CA.ca_*.csv: Files related to combined CA (coordinates, contributions, eigenvalues).
-
images/(Subdirectory, if plots not skipped by--skip-plots) Contains all plot images generated.- Combined Plots: Directly in
images/(e.g., overall GC, ENC vs GC3, combined CA). - Per-Gene RSCU Boxplots: e.g.,
RSCU_boxplot_GENENAME.<fmt>. - Metadata-Specific Plots (if
--color_by_metadatais used): Organized intoimages/<METADATA_COLUMN_NAME>_per_gene_plots/<GENE_NAME>/. Includes ENC vs GC3, Neutrality, CA biplots, Dinucleotide abundance plots, all colored by metadata categories.
- Combined Plots: Directly in
-
html/(Subdirectory, if HTML report generated) Contains secondary HTML pages that make up the interactive report. -
pycodon_analyzer.log(or custom name specified by--log-file) The detailed log file for this analysis run, located in the main output directory.
Workflow Example
- Prepare Annotations: Ensure your reference annotation file is in the expected multi-FASTA format with
[gene=...]and[location=...]tags, or adapt theextraction.pymodule to parse your format (e.g., GenBank, GFF3). - Extract Gene Alignments:
pycodon_analyzer extract -a my_ref_annotations.gb.fasta -g my_genome_msa.fasta -r ref_genome_id -o ./gene_alignments
- Run Codon Analysis:
pycodon_analyzer analyze -d ./gene_alignments -o ./codon_analysis_results --ref human -t 0 -v
Reference File Format (--ref for analyze)
Required for CAI, Fop, RCDI calculations. Should be CSV or TSV with columns for 'Codon' and one of 'Frequency', 'Count', 'RSCU', 'Freq', or 'Frequency (per thousand)'. The tool prioritizes finding an 'RSCU' column if present. For meaningful CAI/RCDI interpretation, using a reference set based on highly expressed genes of the target organism is recommended.
Development
- Running Tests:
pip install -e .[dev] # Ensure dev dependencies like pytest are installed pytest
- Type Checking:
mypy src - Linting/Formatting (Example using Ruff):
ruff check src tests ruff format src tests
- Building:
python -m build
TODO / Future Improvements
extractsubcommand:- Add support for standard GenBank and GFF3/GTF annotation file formats.
- Option to directly process unaligned CDS files (perform alignment if requested).
analyzesubcommand:- Implement tAI calculation (requires tRNA data input).
- Further metadata integration:
- Allow statistical comparisons grouped by metadata categories.
- Correlate CA axes with numerical metadata columns.
- Option to filter analysis based on metadata.
- Add more statistical comparison options (e.g., pairwise tests).
- General:
- Implement CI/CD pipeline (e.g., GitHub Actions).
- Consider interactive plots (Plotly/Bokeh).
- Publish package to PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycodon_analyzer-1.0.1.tar.gz.
File metadata
- Download URL: pycodon_analyzer-1.0.1.tar.gz
- Upload date:
- Size: 160.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f55f2bea72a07af18dbe35ec9746050e77a05deee42e5945fc98b1776b85575
|
|
| MD5 |
a802e597a93e577bd49803e51bbb8586
|
|
| BLAKE2b-256 |
7dca0369888b0d735eb589e480c188abee77e7197f5ec9ff3a64b2741e13bdc4
|
File details
Details for the file pycodon_analyzer-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pycodon_analyzer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 126.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
606897aaccff2432b1b4f9ab13db14d9b0e9da80d4810522d2edeb5f2f10acd2
|
|
| MD5 |
929cb07b0e68371816cb1e5a12b3af78
|
|
| BLAKE2b-256 |
81c15506fae367792c7b8e50f677d5f92de46acca24275cf03bbcf2aa5e0ee41
|