Skip to main content

Command-line interface for Olmsted data processing

Project description

olmsted-cli

Command-line interface and data processing utilities for the Olmsted webapp. The Olmsted web application can be launched locally through the git repository, or is also available at https://www.olmstedviz.org.

See also:

  • FORMATS.md: Input/output format specifications, field mapping, validation
  • ARCHITECTURE.md: System architecture, data flow, design decisions
  • DEVELOPMENT.md: Development guide, testing, contributing
  • CLAUDE.md: AI assistant guidance, code quality rules

Overview

olmsted-cli is a Python package that processes immunological data from AIRR and PCP formats into the Olmsted JSON format for visualization in the Olmsted web application. It handles sequencing data, reconstructs phylogenetic trees, and calculates various metrics for clonal family analysis.

Typical Workflow

  1. Process your data: Use olmsted-cli to convert your AIRR or PCP format files into Olmsted JSON format
  2. Open Olmsted web app: Launch the application locally or visit https://www.olmstedviz.org
  3. Load your processed files: Upload the Olmsted JSON file(s)
  4. Visualize: Explore your data with interactive visualizations

Example:

# Convert your PCP data to Olmsted format
olmsted process -i pcp.csv --tree trees.csv -o olmsted_data.json --compute-metrics

Supported Formats

  • AIRR (Adaptive Immune Receptor Repertoire): JSON format following AIRR Community standards
  • PCP (Parent-Child Pair): CSV file containing parent-child pairs with separate trees CSV file containing Newick strings

Output Formats

Consolidated (default): Single JSON file containing all data - recommended for most workflows. Unbundled (--unbundle): Separates data into component files (datasets.json, clones..json, tree..json) for backwards compatibility with older Olmsted versions.


Installation

Recommended (pipx)

Install using pipx for isolated environment:

pipx install olmsted-cli

Standard Installation

Install using pip:

pip install olmsted-cli

From source (development / latest)

Install the latest unreleased code straight from GitHub:

pipx install git+https://github.com/matsengrp/olmsted-cli.git
# or, for a local checkout you intend to edit:
pip install -e ".[dev]"

Quick Start

# Process AIRR format data (auto-detected)
olmsted process -i data.json -o output/olmsted_data.json

# Process PCP format data with phylogenetic metrics
olmsted process -i sequences.csv --tree trees.csv -o output/data.json --compute-metrics

Available Commands

Overview

Command Purpose
process This is the primary tool: Converts input AIRR or PCP format data into Olmsted-readable JSON format
tag Add field metadata to existing Olmsted JSON files
merge Merge external mutation-level CSV data into existing Olmsted JSON files
build-config Generate a YAML config from your data for editing
validate Verify data files conform to Olmsted schema
summary Generate statistics and metadata report for processed data
split Divide large consolidated files into smaller chunks for performance

Commands

process - Process Data Files

Convert AIRR or PCP format data into Olmsted JSON format.

Basic Usage

# Auto-detect format
olmsted process -i input.json -o output.json

# Explicitly specify format
olmsted process -i input.csv -f pcp -o output.json

Input/Output Options

Option Description
-i, --inputs FILES Input file(s). For AIRR: one or more JSON files. For PCP: CSV file
-o, --output FILE Output file path for consolidated JSON
--unbundle DIR Unbundle output into separate component files (datasets.json, clones..json, tree..json) for backwards compatibility with Olmsted web app
-f, --format {airr,pcp,auto} Input format (default: auto-detect)
-t, --tree FILE Trees file for PCP format (optional, can be gzipped)
--mutations FILE Mutation-level CSV file to merge into tree nodes after processing (see merge command)
-c, --config FILE YAML configuration file (CLI arguments override config values)

Processing Options

Option Description
-n, --name NAME Optional dataset name (stored in metadata)
--validate Validate output against schemas before writing
--strict-validation Exit with error if validation fails
--allow-duplicate-ids Downgrade duplicate-*_id errors to warnings and pass data through unchanged. Without this flag, processing fails when dataset_id, clone_id, tree_id, sample_id, or subject_id collide within their natural uniqueness scope.
--seed INT Random seed for deterministic UUID generation
--batch-size N Clonal families per streaming batch (default: 50). Bounds peak memory by spooling each batch's clones/trees to disk and stream-stitching the final consolidated JSON. Pass 0 to disable streaming and use the legacy one-shot path. Small inputs (n_families ≤ batch_size) skip the spool automatically.
-v, --verbose {0,1,2,3} Verbosity: 0=quiet, 1=normal (default), 2=verbose, 3=debug
-q, --quiet Quiet mode - only show errors (equivalent to -v 0)
-w, --warnings Show warnings when tree and PCP data disagree (PCP only)

PCP-Specific Options

Option Description
--compute-metrics Compute LBI, LBR, affinity, and mutation frequency for all nodes
--lbi-tau FLOAT Time scale parameter for LBI calculation (default: 0.0125)
--standardize-names Rename nodes to standard format: naive (root), Node1, Node2, ...

AIRR-Specific Options

Option Description
--naive-name NAME Name of naive/root node for tree rooting (default: "naive")
-r, --root-trees Root trees using naive node

Examples

# Auto-detect AIRR format and process multiple input files
olmsted process -i dataset1.json dataset2.json -o combined.json

# Process PCP format with separate trees file and compute metrics
olmsted process -i sequences.csv --tree trees.csv -o output.json --compute-metrics

Input Formats

PCP CSV Format

Expected columns in the PCP CSV file:

Column Description
sample_id Sample identifier
family Clonal family identifier
parent_name Parent node name (use "naive" for root)
parent_heavy Parent heavy chain sequence
child_name Child node name
child_heavy Child heavy chain sequence
branch_length Branch length between parent and child
depth Depth in tree
distance Distance from root
v_gene_heavy V gene assignment
j_gene_heavy J gene assignment
cdr1_codon_start_heavy CDR1 start position
cdr1_codon_end_heavy CDR1 end position
cdr2_codon_start_heavy CDR2 start position
cdr2_codon_end_heavy CDR2 end position
cdr3_codon_start_heavy CDR3 start position
cdr3_codon_end_heavy CDR3 end position
parent_is_naive Boolean indicating if parent is naive/root
child_is_leaf Boolean indicating if child is a leaf node

Trees CSV Format

Expected columns in the trees file:

Column Description
family_name Clonal family identifier (must match family in PCP CSV)
sample_id Sample identifier (must match sample_id in PCP CSV)
newick_tree Newick format tree string for the family
tree_id (optional) Stable per-tree identifier. Required to disambiguate multiple rows per (family_name, sample_id) — i.e. alternate phylogenetic reconstructions of the same clonal family. Synthesized as tree-{family_id} when absent.
reconstruction_method (optional) Label for the method that built this tree (e.g. "dnapars", "raxml_ng"). Written to tree.reconstruction_method on the output; left unset when the column is absent.

Multiple rows with the same (family_name, sample_id) produce multiple entries in clone.trees[] (one per alternate reconstruction). Duplicate tree_id values within a clone fail the output uniqueness check; see --allow-duplicate-ids to opt out.


tag - Add Field Metadata to Existing Files

Add field_metadata to pre-built Olmsted JSON files. This is useful for data produced outside the standard process pipeline.

Basic Usage

# Introspect fields and add metadata
olmsted tag -i data.json -o tagged.json

# With custom field declarations
olmsted tag -i data.json -o tagged.json -c config.yaml

# Modify file in place
olmsted tag -i data.json --in-place -c config.yaml

Options

Option Description
-i, --input FILE Input Olmsted JSON file
-o, --output FILE Output file path (required unless --in-place)
--in-place Modify the input file in place
-c, --config FILE YAML config with custom field declarations
--json-format {pretty,compact} JSON output format (default: pretty)
-v, --verbose Show detailed output

merge - Merge External Mutation Data into Olmsted JSON

Attach mutation-level annotations (e.g., surprise scores, selection contributions) from an external CSV onto an existing Olmsted JSON file. For each tree node, mutations are derived from parent/child amino acid sequence diffs and matched against CSV rows by (family, site, parent_aa, child_aa). Matching CSV columns are merged onto the mutation records, and field_metadata is regenerated so the new fields appear in the web app's controls.

The same logic is also available during initial processing via olmsted process --mutations — see the process section.

merge also backfills per-node length/distance from each tree's newick string when the branch lengths are present there but missing on the nodes (the common case for a hand-built base JSON). Without this, the webapp's "evolutionary distance from naive" branch-length mode silently falls back to topological depth. Values already on the nodes are left untouched.

Basic Usage

# Merge mutation scores into an existing Olmsted JSON
olmsted merge -i base.json --mutations scores.csv -o output.json

# In-place modification
olmsted merge -i base.json --mutations scores.csv --in-place

# With a config file (preserves custom_fields declarations)
olmsted merge -i base.json --mutations scores.csv -c config.yaml -o output.json

Options

Option Description
-i, --input FILE Input Olmsted JSON file
--mutations FILE Mutations CSV file (see format below)
--mutations-use-depth Use the CSV's depth column as a match-key participant or integrity check (opt-in)
--mutations-allow-mismatch Downgrade integrity mismatches from a hard failure to a warning
--mutations-listed-only Treat the CSV as authoritative — drop derived mutations on CSV-matched trees that don't appear in the CSV
-o, --output FILE Output file path (required unless --in-place)
--in-place Modify the input file in place (refused if zero trees match)
-c, --config FILE YAML config with custom field declarations
--json-format {pretty,compact} JSON output format (default: pretty)
-v, --verbose {0,1,2,3} Verbosity level (use -v 2 for per-family unmatched detail)

Mutations CSV Format

Required columns:

Column Description
family Clonal family identifier (joined against clone_id in the Olmsted JSON)
site Integer amino acid position (0-based)
parent_aa Single-character parent amino acid
child_aa Single-character child amino acid

Any additional columns become mutation-level fields on matching nodes (e.g., surprise_mutsel, selection_contribution, log_selection_factor). These known structural columns are recognized but not added to the output: sample_id, pcp_index, depth.

Reporting

At normal verbosity, merge reports:

  • Total CSV rows loaded and number of families
  • Trees matched, mutations merged, nodes affected
  • Warning if any CSV families had no matching clone_id in the JSON
  • Warning if CSV rows in matched families had no corresponding derived mutation

Use -v 2 to see per-family details about which specific (site, parent_aa, child_aa) tuples didn't match.


build-config - Generate Config from Data

Introspect your data and generate a YAML config listing processing options, every discoverable field with its inferred type/label/sample values, and cross-format alias suggestions. Edit the config, then use it with process or tag.

Typical Workflow

# 1. Generate a config from your data
olmsted build-config -i data.json -o config.yaml

# 2. Edit config.yaml — remove fields you don't need, fix labels, adjust types

# 3. Use the config to tag your data
olmsted tag -i data.json -o tagged.json -c config.yaml

Options

Option Description
-i, --input FILE Input Olmsted JSON file to introspect
-o, --output FILE Output YAML file (default: print to stdout)

Example Output

custom_fields:
  # --- Family level (clonal family — scatterplot axes, color, facet) ---
  - name: mean_mut_freq
    level: family
    type: continuous
    label: "Mean Mutation Frequency"
    # sample values: 0.115, 0.056, 0.036, ...

  - name: rearrangement_count
    output_name: unique_seqs_count    # suggested cross-format alias
    level: family
    type: continuous
    label: "Rearrangement Count"

  # --- Mutation level (alignment coloring) ---
  - name: selection_contribution
    level: mutation
    type: continuous
    label: "Selection Contribution"
    # range in data: [-2.5, 5.1]

  # =================================================================
  # Skipped fields (not included in output metadata)
  # =================================================================
  - name: partition
    level: family
    skip: true
    type: tooltip
    label: "Partition"

Configuration Files

Instead of passing all options on the command line, you can use a YAML configuration file. CLI arguments always override config values.

# Use a config file
olmsted process -c config.yaml

# Config with CLI overrides (CLI wins)
olmsted process -c config.yaml -i other_data.csv -o override.json

Default Configs

Default configuration files are included with the package as starting points. Copy one and customize it for your dataset:

Config Format Purpose
pcp.yaml PCP Standard PCP processing with all options documented
airr.yaml AIRR Standard AIRR processing with all options documented
olmsted.yaml Tag Custom field declarations for pre-built Olmsted JSON data

To copy a default config:

# Find the configs directory
python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])"

# Copy and customize
cp $(python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])")/pcp.yaml my_config.yaml

Config File Structure

# Standard CLI options (use underscores, not hyphens)
inputs: [data.csv]
output: output/result.json
tree: trees.csv
format: pcp
name: "My Dataset"
description: "Heavy chain BCR data from experiment X"
seed: 42
compute_metrics: true
lbi_tau: 0.0125
verbose: 1
validate: true

# Custom field declarations
custom_fields:
  - name: my_metric
    level: family             # family, node, branch, or mutation
    type: continuous          # continuous, categorical, tooltip, aa, or dna
    label: "My Metric"       # Display label in web app

  - name: internal_id
    level: family
    skip: true                # exclude from output metadata
    type: categorical
    label: "Internal ID"

Custom Fields

The custom_fields section lets you declare additional data fields that should appear in the web app's visualization controls. Each entry supports:

Key Description
name Field name as it appears in the input data
output_name (optional) Renamed field in output (for cross-format alignment)
level family (scatterplot), node (tree nodes), branch (branches), mutation (alignment)
type continuous, categorical, tooltip, aa (amino acid), or dna (nucleotide)
label Human-readable label for dropdowns and tooltips
skip (optional) true to exclude from output metadata
range (optional) [min, max] for continuous fields (color scale domain)

Levels: family is the preferred name for the clonal family level (also accepts clone as an alias). The output JSON uses clone internally for backward compatibility.

Types: aa and dna tell the web app to use the full genetic alphabet for color palettes, rather than just the values present in the data.

Standard fields (e.g., unique_seqs_count, v_call, lbi) are auto-detected and don't need to be declared. Use build-config to generate a starting config with all discoverable fields.


validate - Validate Data Files

Validate Olmsted/AIRR data files against schemas.

Basic Usage

# Auto-detect file type
olmsted validate data.json

# Validate specific file types
olmsted validate --dataset datasets.json
olmsted validate --clones clones.family1.json clones.family2.json
olmsted validate --tree tree.abc123.json

Options

Option Description
--dataset FILE Validate as dataset file
--clone FILE Validate as single clone object
--clones FILES Validate as clone collection
--tree FILE Validate as single tree object
--trees FILES Validate as tree collection
-v, --verbose Show detailed validation output
--strict Exit with error on first validation failure

Examples

# Validate complete consolidated file
olmsted validate output.json

# Verbose validation with strict mode
olmsted validate -v --strict processed_data.json

summary - Generate Summary Statistics

Analyze consolidated Olmsted data files and generate summary statistics.

Basic Usage

# Print summary to stdout
olmsted summary data.json

# Save summary to file
olmsted summary data.json -o summary.txt

# Output as JSON
olmsted summary --json data.json

Options

Option Description
-o, --output FILE Output file (default: stdout)
--json Output summary as JSON format

Example Output

Olmsted Data Summary
====================
Datasets: 2
Total Clones: 1,234
Total Tree Nodes: 5,678
  - Leaf Nodes: 2,345
  - Internal Nodes: 3,333

Metrics Available:
  - LBI: Yes
  - LBR: Yes
  - Affinity: Yes
  - Mean Mutation Frequency: Yes

split - Split Large Files

Split consolidated Olmsted data files into smaller files for better performance.

Basic Usage

# Split into files with max 100 clones each
olmsted split -i large_data.json -o output_dir --max-clones 100

# Split with custom naming
olmsted split -i data.json -o splits --max-clones 50 --base-name my_dataset

Options

Option Description
-i, --input FILE Input consolidated JSON file to split
-o, --output-dir DIR Output directory for split files
--max-clones INT Maximum clones per output file (default: 100)
--base-name NAME Base name for output files

Example Data

The repository includes example data for both formats:

# Clone repository to access examples
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli/example-data

# AIRR format examples
ls airr/

# PCP format examples
ls pcp/

Requirements

  • Python: 3.8 or higher
  • Dependencies (automatically installed):
    • ete3 ≥3.1.0
    • jsonschema ≥4.0.0
    • lxml ≥4.6.0
    • numpy ≥1.20.0
    • pyyaml ≥6.0
    • scipy ≥1.7.0
    • ntpl ≥0.0.4
    • tqdm ≥4.65.0

Development Setup

# Clone and install with dev dependencies
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check .

Links


Last updated: 2026-06-09

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olmsted_cli-0.4.0.tar.gz (733.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

olmsted_cli-0.4.0-py3-none-any.whl (168.4 kB view details)

Uploaded Python 3

File details

Details for the file olmsted_cli-0.4.0.tar.gz.

File metadata

  • Download URL: olmsted_cli-0.4.0.tar.gz
  • Upload date:
  • Size: 733.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olmsted_cli-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2904b1a51cae74c303d1a14e99a180a5501e6f1729d833d8208ec91ad3be7289
MD5 caa092f2b51a84692cb91e452a3389f8
BLAKE2b-256 fd21d2055d199e3136d2eae06ef391a57ba25d04932e9913a64a771ad4d979ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for olmsted_cli-0.4.0.tar.gz:

Publisher: release.yml on matsengrp/olmsted-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file olmsted_cli-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: olmsted_cli-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 168.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olmsted_cli-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c122b29e742a90a8c84d862c8682839336fb493170f7abfd1bba122a7bc8789
MD5 6916764f6b644253c6819e9cfa03443b
BLAKE2b-256 acbb37d37372167a00dce1c1d96e279598abdc0879f3c10c63cf035bdd07e0fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for olmsted_cli-0.4.0-py3-none-any.whl:

Publisher: release.yml on matsengrp/olmsted-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page