Skip to main content

aProfiler: MSA Statistics & Visualization Toolkit

Project description

aProfiler Logo

aProfiler

MSA Statistics & Visualization Toolkit

aProfiler examines Multiple Sequence Alignments (MSAs) and emits useful statistics, publication-grade plots, codon-aware metrics (RSCU and essential amino acid summaries), embeddings, and CSV tables from a single command-line call.

PyPI version

Python Versions License


Installation

pip install -e .

(For a future versioned release: pip install aprofiler)


CLI Usage

aprofiler --input alignment.fasta --mode auto --report

Examples on test data:
aprofiler --input .\data\test-TP53-nt.fasta --report
aprofiler --input .\data\test-TP53-nt.fasta --report --mode codon
aprofiler --input .\data\test-TP53-aa.fasta --report

Modes

Mode Description
nt Nucleotide MSA (DNA/RNA, IUPAC tolerated)
aa Amino acid MSA (protein residues)
codon Coding sequence MSA analyzed at codon level (standard genetic code by default)
auto Auto-detect between NT and AA (codon is never auto-selected and must be explicit)

Common Flags

Flag Purpose
--input Input MSA file (FASTA MSA, A3M, or fixed-column alignment)
--mode nt, aa, codon, or auto
--report Generate a summary report (.md or .html)
--report-format md or html (default: md)
--no-plots Skip plots, output CSV tables only

Outputs (Saved Automatically)

All results are saved under:

./results/{alignment_name}/

CSV Tables

Output File
Global NT or AA frequencies *_global_freqs.csv
Per-site NT stats + entropy + GC% *_nt_per_site.csv
PCA embeddings *_pca_embedding.csv
UMAP embeddings *_umap_embedding.csv
Codon usage table *_codon_global.csv
Relative Synonymous Codon Usage (RSCU) *_codon_rscu.csv
Amino acid usage derived from codons *_aa_from_codons.csv
Essential vs non-essential AA summary (from codons) *_aa_essential_summary.csv

Plots (On by default unless disabled)

Plot Purpose
Nucleotide logo plot Position-wise base enrichment
GC% per-site GC landscape across MSA
Entropy per-site Conservation skyline
AA/NT per-site heatmaps Residue/base prevalence
PCA scatter Sequence-space clustering
UMAP scatter Similarity-space embedding
Pairwise identity histogram Sequence similarity distribution
Gap fraction histogram Alignment completeness QC
Codon usage barplot Most frequent codons
RSCU heatmap Synonymous codon bias by AA
Essential AA barplot Essential vs non-essential AA trends (codon mode only)

Alignment Format Compatibility and Constraints

Input alignment type Supported? Notes
FASTA MSA Yes Sequences must be equal length
A3M Yes Lowercase letters denote inserted columns
ALN/Clustal Yes Must be in fixed columns or converted first
Codon FASTA Yes Requires explicit --mode codon
Mixed NT+AA alphabets No Alphabet must be uniform per file

All alignments are treated as fixed, rectangular matrices; sequences must have equal alignment length.


Output Guarantees

All profiling artifacts are written into results/ without overwriting unrelated files. Ambiguous input characters are tolerated but tracked, not silently discarded. Codon mode metrics are only computed when explicitly requested.


Example Output Directory Tree

results/
  TP53_alignment/
    TP53_global_freqs.csv
    TP53_nt_per_site.csv
    TP53_pca_embedding.csv
    TP53_umap_embedding.csv
    TP53_report.html
    plots/
      TP53_nt_logo.png
      TP53_entropy.png
      TP53_gc.png
      TP53_pca.png
      TP53_umap.png
      TP53_pairwise_identity.png
      TP53_gap_fraction.png
      TP53_rscu_heatmap.png
      TP53_aa_essential_bar.png

Scalability Notes

Optimized for MSAs up to ~20k sequences × 10k columns on standard hardware. Larger inputs may require downsampling for logos/heatmaps (--max-seqs, --max-sites).


Reproducibility

All embeddings (PCA/UMAP) and stochastic plots use fixed random_state when --seed is passed for reproducibility.


Testing

pip install pytest
pytest -q

Tests validate:

  • equal-length enforcement
  • stable fallback embedding behavior
  • non-empty CSV outputs
  • plot and artifact creation without crashes

Python API Example

from aprofiler.profiler import AlignmentProfiler

prof = AlignmentProfiler("alignment.fasta", mode="auto", out_dir="results")
prof.load_alignment()
outputs = prof.run_full_profile()
report_path = prof.generate_report(outputs, fmt="md")

print("Outputs generated:", outputs)
print("Report saved to:", report_path)

Citation

If you use aProfiler in a publication, please cite:

Lucaci, Alexander G., *aProfiler: MSA Statistics & Visualization Toolkit*, 2025.

For formal reproducible citation, you can later replace this with a Zenodo or PyPI DOI once released.


Contributing

Issues, discussions, and pull requests are welcome. Ensure contributions are:

  • statistically useful
  • plot-rich by design
  • free of silent failures
  • non-destructive to unrelated files
  • aligned with package philosophy and constraints

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aprofiler-0.1.1.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aprofiler-0.1.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file aprofiler-0.1.1.tar.gz.

File metadata

  • Download URL: aprofiler-0.1.1.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for aprofiler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8e89bb7401110fc688e81e0f65321083e949f291c14696fa9bfa14e3759aa0c0
MD5 a201b3ec8c0ef2aec96bb8c627873eee
BLAKE2b-256 c8c328d00fad404932c20b3b1ccd0a74de72c6b2e35018001ca701d5481115a0

See more details on using hashes here.

File details

Details for the file aprofiler-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: aprofiler-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for aprofiler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 24eed45826d6820eca3eb5061d40ba095bc1bb11c247b82d837ed9dc7e1695d0
MD5 267d2a11f34942e5f0c509ac0fc21f1a
BLAKE2b-256 a767924682e5284b21d93f7543f7d4cd61a0c968a7aedb8b43012ac4bd4595e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page