Skip to main content

Evolutionary Walsh-Hadamard transform and compressed sensing for protein fitness landscapes

Project description

eWHT: Evolutionary Walsh-Hadamard Transform for Fitness Landscapes

PyPI version PyPI - License PyPI Status PyPI Version Code Style Last Commit

ewht is a Python package for analyzing combinatorial fitness landscapes using the evolutionary Walsh-Hadamard transform (eWHT). It provides:

  • Fast O(N log N) forward and inverse eWHT transforms
  • Evolutionary mutation probabilities ps from MSAs or ESM2-650M
  • Data preprocessing helpers (genotype encoding, evolutionary subsampling)
  • Compressed sensing with LASSO on eWHT/WHT bases

Installation

ewht supports Python 3.9 and above. Install from PyPI:

pip install ewht

Optional extras:

pip install ewht[esm]   # ESM2-650M ps estimation (requires torch + transformers)

Quickstart

The package contains an example CR6261-H1 dataset from the paper. Load it, estimate ps from MSA, compute the eWHT, and run compressed sensing. The full script can be found in example_ewht.py:

import ewht

# Load data and preprocess
raw = ewht.load_example()
print(raw.head())
       mutant                                   mutated_sequence  fitness  estimated_fitness
0          WT  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0
1       L104V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0
2        A79V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0
3  A79V;L104V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0
4        S77G  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0
POSITIONS = [28, 30, 58, 59, 62, 74, 75, 76, 77, 79, 104]
MUTANTS = ["P", "R", "T", "K", "P", "D", "F", "A", "G", "V", "V"]
WT = (
    "QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPEWMGGIIPIFGTANYAQKFQGRVTITADKSTSTAYMELSSLRSEDTAMYYCAKHMGYQLRETMDVWGQGTTVTVSS"
)
L = len(POSITIONS)
print(df.head())
print(f"{df['genotype'].nunique()} unique genotypes, L={L}")
       mutant                                   mutated_sequence  fitness  estimated_fitness     genotype
0          WT  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0  00000000000
1       L104V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0  00000000001
2        A79V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0  00000000010
3  A79V;L104V  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0  00000000011
4        S77G  QVQLVESGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGPE...      7.0                  0  00000000100
2048 unique genotypes, L=11

with example_msa() as msa_path:
    # Compute ps from MSA
    ps = get_ps(WT_SEQUENCE, POSITIONS, MUTANTS, msa=msa_path)
    plot_ps(ps, OUTPUT_DIR / "ps_from_msa.png")

    # Compute eWHT
    coeffs, center = efwht_from_dataframe(df, ps, basis="eWHT")
    plot_ewht_spectrum(coeffs, L, OUTPUT_DIR / "ewht_spectrum.png", max_order=MAX_ORDER)

    # Sample evolutionary sequences for compressed sensing
    train, test = sample_evolutionary_sequences(
        df,
        ps,
        msa=msa_path,
        positions=POSITIONS,
        wt_sequence=WT_SEQUENCE,
        mutants=MUTANTS,
        fraction=0.75,
        train_n=TRAIN_N,
        random_state=0,
    )
    print(f"train={len(train)}, test={len(test)}")
    train=100, test=162

    # Run compressed sensing experiment
    result = run_cs_experiment(train, test, ps, basis="eWHT", center_by_ps=True, random_state=0)
    print(f"best lambda: {result.best_lambda}")
    print(f"train R²: {result.train_metrics['r2']:.4f}")
    print(f"test R²:  {result.test_metrics['r2']:.4f}")
    best lambda: 0.005
    train R²: 0.9662
    test R²:  0.8282

print(f"Figures in {OUTPUT_DIR.resolve()}/")

Run the full example:

python example_ewht.py

Evolutionary mutation probabilities

get_ps estimates per-site mutation probabilities from an MSA or, if no MSA is given, from ESM2-650M:

Per-site mutation probabilities from MSA

eWHT spectrum

The forward transform decomposes the centered landscape into Walsh coefficients grouped by interaction order:

eWHT coefficient spectrum orders 1-5

Core API

Function Description
efwht_from_dataframe(df, ps) Forward eWHT from a preprocessed DataFrame
efwht(y, ps) Forward eWHT on a length-2^L landscape vector
iefwht(coeffs, ps) Inverse eWHT (exact round-trip with matching norm)
get_ps(sequence, positions, mutants, msa=...) Per-site mutation probabilities
genotypes_from_dataframe(df, positions, wt_sequence, mutants) Build binary genotype column from sequences
sample_evolutionary_sequences(df, ps, ...) Evolutionary subsampling with optional MSA mask
run_cs_experiment(train, test, ps) Lasso compressed sensing with CV on train

Genotype encodings

ewht accepts genotypes as:

  • Binary strings: "00101" (0 = WT, 1 = mutant)
  • Pseudoboolean strings: "1-1-11" (1 = WT, -1 = mutant)

For custom mappings, add a genotype column directly instead of using genotypes_from_dataframe.

Optional dependencies

Extra Packages Use case
(default) numpy, pandas, scipy, scikit-learn transforms, MSA-based ps, CS
ewht[esm] torch, transformers ps from ESM2-650M when no MSA is available

Publishing to PyPI

From a clean checkout of the repository:

# Install build tools
pip install build twine

# Build sdist + wheel (includes bundled example_data/)
python -m build

# Upload to TestPyPI first (recommended)
twine upload --repository testpypi dist/*

# Verify install
pip install --index-url https://test.pypi.org/simple/ ewht

# Upload to PyPI
twine upload dist/*

Before the first upload:

  1. Create accounts on PyPI and TestPyPI.
  2. Configure an API token: ~/.pypirc or TWINE_USERNAME=__token__ / TWINE_PASSWORD=pypi-....
  3. Ensure the package name ewht is available on PyPI (or change name in pyproject.toml).
  4. Bump version in pyproject.toml and ewht/__init__.py for each release.

Development

pip install -e ".[esm]"
pytest tests/ -v -m "not slow"
python example_ewht.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ewht-0.0.1.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ewht-0.0.1-py3-none-any.whl (3.2 MB view details)

Uploaded Python 3

File details

Details for the file ewht-0.0.1.tar.gz.

File metadata

  • Download URL: ewht-0.0.1.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for ewht-0.0.1.tar.gz
Algorithm Hash digest
SHA256 db3cdd4bb9cf30d2965ed7dfd564eb4eac91bbb4225461c20241b627461d4afc
MD5 dd56222b6e4bb94e8536114aefb43a8b
BLAKE2b-256 d4156a93fdb833c1ca1f208ac84b656285ada30cf193d725c7788c3c23b5e9f6

See more details on using hashes here.

File details

Details for the file ewht-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ewht-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for ewht-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3c9038a879b5ae8e6ae8e2070cddd41968bddbb737317fec6fbaf33670134c71
MD5 cdb70df98ad6fc6f008364c59d4910ad
BLAKE2b-256 27d5c1ce8d57601f4834b26218864d8f97845b5829fcf9417a9259794c754e42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page