Skip to main content

CLI and helpers for scoring Drosophila proboscis responses from envelope or raw coordinate data.

Project description

FlyPCA

FlyPCA provides a reproducible, event-aligned lag-embedded PCA workflow for Drosophila proboscis-distance time series. The package smooths and baseline-normalizes traces, performs Hankel (time-delay) embedding, learns compact principal components, derives interpretable behavioral features, and clusters trials into reaction vs. non-reaction cohorts.

Pipeline Overview

  1. Ingest trial CSVs or manifests (trial_id, fly_id, distance, odor indices).
  2. Preprocess each trial with Savitzky–Golay smoothing, optional low-pass filtering, and pre-odor z-scoring.
  3. Lag Embed & PCA using Hankel matrices to preserve local temporal structure; fit PCA or IncrementalPCA.
  4. Project trials into PC trajectories aligned to odor onset.
  5. Engineer Features capturing temporal dynamics, velocity, Hilbert envelope, frequency bands, and PC-space summaries.
  6. Cluster & Evaluate with GMM or HDBSCAN and compute silhouette, Calinski–Harabasz, AUROC, and AUPRC (leave-one-fly-out).
  7. Visualize & Report scree plots, loadings, trajectories, cluster scatter, violin plots, and markdown reports.

Quickstart

make venv
source .venv/bin/activate
make install
make test

Generate a synthetic demo dataset and full report:

make demo

Running on Real Data

  1. Assemble a manifest or wide CSV describing each trial.

    • Stacked format: one row per timepoint with columns trial_id, fly_id, distance, odor_on_idx, optional odor_off_idx, optional time, and optional fps.
    • Wide format: one row per trial where the time series samples occupy columns with a consistent prefix (e.g., dir_val_0, dir_val_1, …). Provide metadata columns for trial identity, fly identity, odor indices, and fps.
  2. Map column names in the config. Copy configs/default.yaml and update the io section to match your data. Example for the wide file shown in the error transcript:

    io:
      format: wide
      read_csv:
        low_memory: false
        dtype:
          trial_label: str
      wide:
        trial_id_column: trial_label
        trial_id_template: "{fly}_{trial_label}"
        fly_id_column: fly
        fps_column: fps
        odor_on_value: 1230
        odor_off_value: 2430
        time_columns:
          prefix: dir_val_
    

    Setting dtype ensures pandas does not emit mixed-type warnings. For stacked data, adjust io.stacked.distance_column, io.stacked.time_column, etc., instead.

  3. Verify indices: odor_on_idx and odor_off_idx are frame indices (0-based). They must be within [0, n_frames) and odor_on_idx < odor_off_idx. Ensure the time column is strictly increasing if present; for wide data the loader generates time stamps using fps.

  4. Run the CLI pipeline. The commands below fit the lag-embedded PCA model, project each trial, engineer features, cluster reactions, and generate a Markdown report with key plots.

flypca fit-lag-pca \
  --data data/manifest.csv \
  --config configs/default.yaml \
  --out artifacts/models/lagpca.joblib

flypca project \
  --model artifacts/models/lagpca.joblib \
  --data data/manifest.csv \
  --out artifacts/projections/

flypca features \
  --data data/manifest.csv \
  --config configs/default.yaml \
  --model artifacts/models/lagpca.joblib \
  --projections artifacts/projections/ \
  --out artifacts/features.parquet

flypca cluster \
  --features artifacts/features.parquet \
  --config configs/default.yaml \
  --projections-dir artifacts/projections/ \
  --method gmm \
  --out artifacts/cluster.csv \
  --labels-path data/labels.csv \
  --labels-column-name user_score_odor \
  --label-column user_score_odor

flypca report \
  --features artifacts/features.parquet \
  --clusters artifacts/cluster.csv \
  --model artifacts/models/lagpca.joblib \
  --projections artifacts/projections/ \
  --out-dir artifacts/

Outputs are written under artifacts/ by default: the trained PCA model (models/), projected PC trajectories (projections/), engineered features (features.parquet), clustering assignments, summary figures (figures/), and a Markdown report describing variance explained, cluster metrics, and representative trajectories.

CLI entry points (Typer-based):

flypca fit-lag-pca --data data/manifest.csv --config configs/default.yaml --out artifacts/models/lagpca.joblib
flypca project --model artifacts/models/lagpca.joblib --data data/manifest.csv --out artifacts/projections/
flypca features --data data/manifest.csv --config configs/default.yaml --model artifacts/models/lagpca.joblib --projections artifacts/projections/ --out artifacts/features.parquet
flypca cluster --features artifacts/features.parquet --config configs/default.yaml --projections-dir artifacts/projections/ --method gmm --out artifacts/cluster.csv --label-column reaction

# cluster with label CSV
flypca cluster \
  --features artifacts/features.parquet \
  --config configs/default.yaml \
  --projections-dir artifacts/projections/ \
  --labels-path data/labels.csv \
  --labels-column-name user_score_odor \
  --out artifacts/cluster.csv
flypca report --features artifacts/features.parquet --clusters artifacts/cluster.csv --model artifacts/models/lagpca.joblib --projections artifacts/projections/ --out-dir artifacts/

Clustering configuration

  • standardize: z-score the feature/projection matrix before fitting the mixture model (enabled by default).
  • min_variance: drop near-constant columns prior to clustering to prevent degeneracy.
  • component_range: sweep a range of Gaussian mixture sizes (inclusive) and pick the lowest-BIC model with a valid silhouette.
  • covariance_types: evaluate multiple covariance structures (full, diag, etc.) during the sweep.
  • use_projections: auto by default; if projections are supplied they are incorporated automatically, otherwise the feature table alone is clustered. Set to true or false to force behaviour.
  • combine_with_features: auto by default; when projections are used they are concatenated with engineered features unless explicitly disabled.
  • projection_components / projection_timepoints: cap how many PCs and aligned samples are flattened from the NPZ files.

Label CSVs can be merged on-the-fly using --labels-path and --labels-column-name. The helper derives trial_id values by applying the configured template (e.g. {fly}_{trial_label}) or, if absent, by combining fly and trial_label columns. The merged column is available for clustering diagnostics and supervised AUROC/AUPRC evaluation.

When use_projections is enabled the CLI expects projections/manifest.csv (written by flypca project) so trial IDs can be matched automatically.

Expected data layout for manifests:

manifest.csv:
path,trial_id,fly_id,odor_on_idx,odor_off_idx,fps
trial001.csv,tr1,flyA,80,120,40
...

trial001.csv:
frame,time,distance
0,0.00,1.23
...

Testing & Quality

  • Type-annotated, vectorized preprocessing and feature routines.
  • Deterministic seeds; logging records parameter settings and array shapes.
  • Pytest suite covers preprocessing, PCA embedding, feature extraction, and end-to-end synthetic performance (AUROC > 0.8).

Interpreting PCs

  • PC1 typically correlates with response amplitude and integrates the rising phase post-odor.
  • PC2 captures latency and decay kinetics when present.
  • Time-aligned PC trajectories and feature table outputs (parquet) enable downstream classifiers or visualization in standard tools.

Make Targets

  • make venv: create .venv using Python 3.11.
  • make install: install flypca in editable mode with requirements.
  • make test: run unit tests (pytest -q).
  • make demo: synthesize data, run the full CLI pipeline, and emit artifacts (models, projections, features, clusters, figures, report).

Refer to examples/01_synthetic_demo.ipynb for a notebook walkthrough replicating the pipeline with code and inline commentary.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flybehavior_response-0.1.0.tar.gz (128.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flybehavior_response-0.1.0-py3-none-any.whl (64.1 kB view details)

Uploaded Python 3

File details

Details for the file flybehavior_response-0.1.0.tar.gz.

File metadata

  • Download URL: flybehavior_response-0.1.0.tar.gz
  • Upload date:
  • Size: 128.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for flybehavior_response-0.1.0.tar.gz
Algorithm Hash digest
SHA256 86d6ce7c0f363c79de64abe515fd6b70885ef8650fcd1624376a77dd881326ae
MD5 3510636bd4264ab3def01423c3bc89e7
BLAKE2b-256 adade548ed6373ceffbc3458c13e9d528fcf88c11bcd51fb4f6d1b3be5a1355c

See more details on using hashes here.

File details

Details for the file flybehavior_response-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for flybehavior_response-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 937f56199183f7d2a2203f85fcee22d2a56a72d2445502a18d10e90502912ef7
MD5 62b714630297e2ea0848fd04af22ea53
BLAKE2b-256 b9d07c09a3deac5067bb66bff54a098654940cdb8dded23924bae429f896f359

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page