Skip to main content

Lossless evolutionary-aware multiple sequence alignment compressor

Project description

Logo

Docs · Report Bug · Request Feature



Evolution-informed lossless compression of multiple-sequence alignments (MSAs).


Installation

From PyPI (recommended for users):

# create and activate a virtual environment
python -m venv venv
source venv/bin/activate

# install ecomp
pip install ecomp

CLI Quickstart

All commands are exposed through the ecomp entry point.

# Compress an alignment (produces example.ecomp, optional JSON sidecar)
ecomp zip example.fasta --metadata example.json

# Decompress (writes FASTA by default)
ecomp unzip example.ecomp --alignment-output restored.fasta

# Inspect metadata (summary or JSON)
ecomp inspect example.ecomp --summary

# Diagnostics (Phykit-style aliases in parentheses)
ecomp consensus_sequence example.ecomp             # con_seq
ecomp column_base_counts example.ecomp             # col_counts
ecomp gap_fraction example.ecomp                   # gap_frac
ecomp shannon_entropy example.ecomp                # entropy
ecomp parsimony_informative_sites example.ecomp    # parsimony
ecomp constant_columns example.ecomp               # const_cols
ecomp pairwise_identity example.ecomp              # pid
ecomp alignment_length_excluding_gaps example.ecomp    # len_no_gaps
ecomp alignment_length example.ecomp                   # len_total
ecomp variable_sites example.ecomp                     # var_sites
ecomp percentage_identity example.ecomp                # pct_id
ecomp relative_composition_variability example.ecomp   # rcv
ecomp distance_tree example.ecomp                     # dist_tree

Benchmarks mirror standard codec comparisons:

/usr/bin/time -p ecomp zip data/fixtures/small_phylo.fasta --output out.ecomp
/usr/bin/time -p gzip  -k data/fixtures/small_phylo.fasta
/usr/bin/time -p bzip2 -k data/fixtures/small_phylo.fasta

Python API

Everything the CLI does is re-exported in ecomp.

from ecomp import ezip, eunzip, read_alignment, percentage_identity, column_base_counts

# File-based workflow
archive_path, metadata_path = ezip(
    "data/example.fasta",
    metadata_path="data/example.json",  # optional JSON copy
)
restored_path = eunzip(archive_path, output_path="data/restored.fasta")

# Diagnostics on an AlignmentFrame
frame = read_alignment("data/example.fasta")
pct_identity = percentage_identity(frame)
base_counts = column_base_counts(frame)

print(f"Mean pairwise identity: {pct_identity:.2f}%")
print("Column 1 counts:", base_counts[0])

In-memory usage (no intermediate files):

from ecomp import AlignmentFrame, compress_alignment, decompress_alignment

frame = AlignmentFrame(
    ids=["s1", "s2"],
    sequences=["ACGT", "ACGA"],
    alphabet=["A", "C", "G", "T"],
)
compressed = compress_alignment(frame)
restored = decompress_alignment(compressed.payload, compressed.metadata)
assert restored.sequences == frame.sequences

Available functions

Compression & I/Oezip, eunzip, compress_file, decompress_file, compress_alignment, decompress_alignment, read_alignment, write_alignment, alignment_from_sequences, alignment_checksum

Diagnostics & metricscolumn_base_counts, column_gap_fraction, column_shannon_entropy, parsimony_informative_columns, parsimony_informative_site_count, constant_columns, majority_rule_consensus, alignment_length, alignment_length_excluding_gaps, variable_site_count, percentage_identity, relative_composition_variability, pairwise_identity_matrix

Phylogeneticsinfer_distance_tree, infer_distance_tree_from_frame, tree_to_newick

Supporting typesAlignmentFrame, CompressedAlignment, PairwiseIdentityResult, __version__


Development

make test.fast        # unit + non-slow integration tests
make test             # full test matrix
make lint             # lint checks (ruff, black, isort)
make format           # auto-formatting
mypy ecomp            # optional type checking

Build docs locally:

make docs
open docs/_build/html/index.html

Build and publish distributions:

pip install build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*

Benchmarking eComp vs. PhyKIT

python scripts/benchmark_metrics.py data/example.ecomp \
    --operations consensus shannon_entropy variable_sites \
    --repeat 5 --warmup 1 --json results.json --csv results.csv

The script runs each metric via the ecomp CLI (on the compressed archive) and the corresponding phykit command on a decompressed alignment, then reports average and best runtimes. Add --json/--csv to emit machine-readable output.


License

eComp is released under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecomp-0.1.1.tar.gz (36.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecomp-0.1.1-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file ecomp-0.1.1.tar.gz.

File metadata

  • Download URL: ecomp-0.1.1.tar.gz
  • Upload date:
  • Size: 36.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for ecomp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9fa8898a4c0f1a3e53fc6070004adc48e765354b74d595e46f8e05165df0583a
MD5 167fdd2df660f9cdef75d2ccd5972856
BLAKE2b-256 ded18b2fbb1b05658e7c86da2a1ed1f7cda9c4737b2456860539981c9c96edde

See more details on using hashes here.

File details

Details for the file ecomp-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ecomp-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for ecomp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3c37d5e0cf4e2ada4bf7300dd0c7a2899cba64979c3c401545f54ff5b50eb76
MD5 2a7f3419298d7d8b71488268ec0833c7
BLAKE2b-256 0b3e8eeeb0e14576790b9032bc829cb45e8f445aaeee81dabfc75d67868dd6f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page