aProfiler: MSA Statistics & Visualization Toolkit
Project description
aProfiler
MSA Statistics & Visualization Toolkit
aProfiler examines Multiple Sequence Alignments (MSAs) and delivers useful statistics, publication-grade plots, codon-aware metrics (RSCU and essential amino acid summaries), embeddings, and CSV tables from a single command-line call.
Installation
python -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple aprofiler
CLI Usage
aprofiler --input {alignment.fasta} --mode auto --report
Examples on test data
aprofiler --input .\data\test-TP53-nt.fasta
aprofiler --input .\data\test-TP53-nt.fasta --report
aprofiler --input .\data\test-TP53-nt.fasta --report --mode codon
aprofiler --input .\data\test-TP53-aa.fasta --report --report-format md
aprofiler --input .\data\test-TP53-aa.fasta --no-plots --mode aa
Modes
| Mode | Description |
|---|---|
nt |
Nucleotide MSA (DNA/RNA, IUPAC tolerated) |
aa |
Amino acid MSA (protein residues) |
codon |
Coding sequence MSA analyzed at codon level (standard genetic code by default) |
auto |
Auto-detect between NT and AA (codon is never auto-selected and must be explicit) |
Common Flags
| Flag | Purpose |
|---|---|
--input |
Input MSA file (FASTA MSA, A3M, or fixed-column alignment) |
--mode |
nt, aa, codon, or auto |
--report |
Generate a summary report (.md or .html) |
--report-format |
md or html (default: html) |
--no-plots |
Skip plots, output CSV tables only |
--seed |
Set seed for PCA/UMAP reproducibility |
Outputs (Saved Automatically)
All results are saved under:
./results/{alignment_name}/
CSV Tables
| Output | File |
|---|---|
| Global NT or AA frequencies | *_global_freqs.csv |
| Per-site NT stats + entropy + GC% | *_nt_per_site.csv |
| PCA embeddings | *_pca_embedding.csv |
| UMAP embeddings | *_umap_embedding.csv |
| Codon usage table | *_codon_global.csv |
| Relative Synonymous Codon Usage (RSCU) | *_codon_rscu.csv |
| Amino acid usage derived from codons | *_aa_from_codons.csv |
| Essential vs non-essential AA summary (from codons) | *_aa_essential_summary.csv |
Plots (On by default unless disabled)
| Plot | Purpose |
|---|---|
| Nucleotide logo plot | Position-wise base enrichment |
| GC% per-site | GC landscape across MSA |
| Entropy per-site | Conservation skyline |
| AA/NT per-site heatmaps | Residue/base prevalence |
| PCA scatter | Sequence-space clustering |
| UMAP scatter | Similarity-space embedding |
| Pairwise identity histogram | Sequence similarity distribution |
| Gap fraction histogram | Alignment completeness QC |
| Codon usage barplot | Most frequent codons |
| RSCU heatmap | Synonymous codon bias by AA |
| Essential AA barplot | Essential vs non-essential AA trends (codon mode only) |
Alignment Format Compatibility and Constraints
| Input alignment type | Supported? | Notes |
|---|---|---|
| FASTA MSA | Yes | Sequences must be equal length |
| A3M | Yes | Lowercase letters denote inserted columns |
| ALN/Clustal | Yes | Must be in fixed columns or converted first |
| Codon FASTA | Yes | Requires explicit --mode codon |
| Mixed NT+AA alphabets | No | Alphabet must be uniform per file |
All alignments are treated as fixed, rectangular matrices; sequences must have equal alignment length.
Output Guarantees
All profiling artifacts are written into
results/without overwriting unrelated files. Ambiguous input characters are tolerated but tracked, not silently discarded. Codon mode metrics are only computed when explicitly requested.
Example Output Directory Tree
results/
TP53_alignment/
TP53_global_freqs.csv
TP53_nt_per_site.csv
TP53_pca_embedding.csv
TP53_umap_embedding.csv
TP53_report.html
TP53_nt_logo.png
TP53_entropy.png
TP53_gc.png
TP53_pca.png
TP53_umap.png
TP53_pairwise_identity.png
TP53_gap_fraction.png
TP53_rscu_heatmap.png
TP53_aa_essential_bar.png
Scalability Notes
*Optimized for MSAs up to ~20k sequences × 10k columns on standard hardware. Larger inputs may require downsampling for logos/heatmaps
Testing
pip install pytest
pytest -q
Tests validate:
- equal-length enforcement
- stable fallback embedding behavior
- non-empty CSV outputs
- plot and artifact creation without crashes
Python API Example
from aprofiler.profiler import AlignmentProfiler
prof = AlignmentProfiler("alignment.fasta", mode="auto", out_dir="results")
prof.load_alignment()
outputs = prof.run_full_profile()
report_path = prof.generate_report(outputs, fmt="md")
print("Outputs generated:", outputs)
print("Report saved to:", report_path)
Citation
If you use aProfiler in a publication, please cite:
TBD
Contributing
Issues, discussions, and pull requests are welcome. Ensure contributions are:
- statistically useful
- plot-rich by design
- free of silent failures
- non-destructive to unrelated files
- aligned with package philosophy and constraints
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aprofiler-0.1.2.tar.gz.
File metadata
- Download URL: aprofiler-0.1.2.tar.gz
- Upload date:
- Size: 17.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83b1e60e63fbe32835adb07bf6b866c66862ca3d72d26ed1d0f59aaec12091e9
|
|
| MD5 |
00f0d87f1002daaa1b9c82b458d5e8fc
|
|
| BLAKE2b-256 |
95be405fb4faed022f545cf43906820b91506a9e5ef9772055d5720cd585cba1
|
File details
Details for the file aprofiler-0.1.2-py3-none-any.whl.
File metadata
- Download URL: aprofiler-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f52705cea9cbd9e3fcd25df016ecb50cc8a159e2ac8375d433f3a435cccdb0a
|
|
| MD5 |
449bb58d35baeba9f23a11f4017fe4a7
|
|
| BLAKE2b-256 |
6a1901da0beb572eaa3a5efa27dd152237c4a52caed75bb32d93cf772c42f906
|