Command-line interface for Olmsted data processing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

olmsted-cli

Command-line interface and data processing utilities for the Olmsted webapp. The Olmsted web application can be launched locally through the git repository, or is also available at https://www.olmstedviz.org.

Overview

olmsted-cli is a Python package that processes immunological data from AIRR and PCP formats into the Olmsted JSON format for visualization in the Olmsted web application. It handles sequencing data, reconstructs phylogenetic trees, and calculates various metrics for clonal family analysis.

Typical Workflow

Process your data: Use olmsted-cli to convert your AIRR or PCP format files into Olmsted JSON format
Open Olmsted web app: Launch the application locally or visit https://www.olmstedviz.org
Load your processed files: Upload the Olmsted JSON file(s)
Visualize: Explore your data with interactive visualizations

Example:

# Convert your PCP data to Olmsted format
olmsted process -i pcp.csv --tree trees.csv -o olmsted_data.json --compute-metrics

Supported Formats

AIRR (Adaptive Immune Receptor Repertoire): JSON format following AIRR Community standards
PCP (Parent-Child Pair): CSV file containing parent-child pairs with separate trees CSV file containing Newick strings

Output Formats

Consolidated (default): Single JSON file containing all data - recommended for most workflows. Unbundled (--unbundle): Separates data into component files (datasets.json, clones..json, tree..json) for backwards compatibility with older Olmsted versions.

Installation

Recommended (pipx)

Install using pipx for isolated environment:

pipx install olmsted-cli

Standard Installation

Install using pip:

pip install olmsted-cli

From source (development / latest)

Install the latest unreleased code straight from GitHub:

pipx install git+https://github.com/matsengrp/olmsted-cli.git
# or, for a local checkout you intend to edit:
pip install -e ".[dev]"

Quick Start

# Process AIRR format data (auto-detected)
olmsted process -i data.json -o output/olmsted_data.json

# Process PCP format data with phylogenetic metrics
olmsted process -i sequences.csv --tree trees.csv -o output/data.json --compute-metrics

Available Commands

Overview

Command	Purpose
`process`	This is the primary tool: Converts input AIRR or PCP format data into Olmsted-readable JSON format
`tag`	Add field metadata to existing Olmsted JSON files
`merge`	Merge external mutation-level CSV data into existing Olmsted JSON files
`build-config`	Generate a YAML config from your data for editing
`validate`	Verify data files conform to Olmsted schema
`summary`	Generate statistics and metadata report for processed data
`split`	Divide large consolidated files into smaller chunks for performance

Commands

`process` - Process Data Files

Convert AIRR or PCP format data into Olmsted JSON format.

Basic Usage

# Auto-detect format
olmsted process -i input.json -o output.json

# Explicitly specify format
olmsted process -i input.csv -f pcp -o output.json

Input/Output Options

Option	Description
`-i, --inputs FILES`	Input file(s). For AIRR: one or more JSON files. For PCP: CSV file
`-o, --output FILE`	Output file path for consolidated JSON
`--unbundle DIR`	Unbundle output into separate component files (datasets.json, clones..json, tree..json) for backwards compatibility with Olmsted web app
`-f, --format {airr,pcp,auto}`	Input format (default: auto-detect)
`-t, --tree FILE`	Trees file for PCP format (optional, can be gzipped)
`--mutations FILE`	Mutation-level CSV file to merge into tree nodes after processing (see `merge` command)
`-c, --config FILE`	YAML configuration file (CLI arguments override config values)

Processing Options

Option	Description
`-n, --name NAME`	Optional dataset name (stored in metadata)
`--validate`	Validate output against schemas before writing
`--strict-validation`	Exit with error if validation fails
`--allow-duplicate-ids`	Downgrade duplicate-`*_id` errors to warnings and pass data through unchanged. Without this flag, processing fails when `dataset_id`, `clone_id`, `tree_id`, `sample_id`, or `subject_id` collide within their natural uniqueness scope.
`--seed INT`	Random seed for deterministic UUID generation
`--batch-size N`	Clonal families per streaming batch (default: 50). Bounds peak memory by spooling each batch's clones/trees to disk and stream-stitching the final consolidated JSON. Pass `0` to disable streaming and use the legacy one-shot path. Small inputs (n_families ≤ batch_size) skip the spool automatically.
`-v, --verbose {0,1,2,3}`	Verbosity: 0=quiet, 1=normal (default), 2=verbose, 3=debug
`-q, --quiet`	Quiet mode - only show errors (equivalent to `-v 0`)
`-w, --warnings`	Show warnings when tree and PCP data disagree (PCP only)

PCP-Specific Options

Option	Description
`--compute-metrics`	Compute LBI, LBR, affinity, and mutation frequency for all nodes
`--lbi-tau FLOAT`	Time scale parameter for LBI calculation (default: 0.0125)
`--standardize-names`	Rename nodes to standard format: naive (root), Node1, Node2, ...

AIRR-Specific Options

Option	Description
`--naive-name NAME`	Name of naive/root node for tree rooting (default: "naive")
`-r, --root-trees`	Root trees using naive node

Examples

# Auto-detect AIRR format and process multiple input files
olmsted process -i dataset1.json dataset2.json -o combined.json

# Process PCP format with separate trees file and compute metrics
olmsted process -i sequences.csv --tree trees.csv -o output.json --compute-metrics

Input Formats

PCP CSV Format

Expected columns in the PCP CSV file:

Column	Description
`sample_id`	Sample identifier
`family`	Clonal family identifier
`parent_name`	Parent node name (use "naive" for root)
`parent_heavy`	Parent heavy chain sequence
`child_name`	Child node name
`child_heavy`	Child heavy chain sequence
`branch_length`	Branch length between parent and child
`depth`	Depth in tree
`distance`	Distance from root
`v_gene_heavy`	V gene assignment
`j_gene_heavy`	J gene assignment
`cdr1_codon_start_heavy`	CDR1 start position
`cdr1_codon_end_heavy`	CDR1 end position
`cdr2_codon_start_heavy`	CDR2 start position
`cdr2_codon_end_heavy`	CDR2 end position
`cdr3_codon_start_heavy`	CDR3 start position
`cdr3_codon_end_heavy`	CDR3 end position
`parent_is_naive`	Boolean indicating if parent is naive/root
`child_is_leaf`	Boolean indicating if child is a leaf node

Trees CSV Format

Expected columns in the trees file:

Column	Description
`family_name`	Clonal family identifier (must match `family` in PCP CSV)
`sample_id`	Sample identifier (must match `sample_id` in PCP CSV)
`newick_tree`	Newick format tree string for the family
`tree_id` (optional)	Stable per-tree identifier. Required to disambiguate multiple rows per `(family_name, sample_id)` — i.e. alternate phylogenetic reconstructions of the same clonal family. Synthesized as `tree-{family_id}` when absent.
`reconstruction_method` (optional)	Label for the method that built this tree (e.g. `"dnapars"`, `"raxml_ng"`). Written to `tree.reconstruction_method` on the output; left unset when the column is absent.

Multiple rows with the same (family_name, sample_id) produce multiple entries in clone.trees[] (one per alternate reconstruction). Duplicate tree_id values within a clone fail the output uniqueness check; see --allow-duplicate-ids to opt out.

`tag` - Add Field Metadata to Existing Files

Add field_metadata to pre-built Olmsted JSON files. This is useful for data produced outside the standard process pipeline.

Basic Usage

# Introspect fields and add metadata
olmsted tag -i data.json -o tagged.json

# With custom field declarations
olmsted tag -i data.json -o tagged.json -c config.yaml

# Modify file in place
olmsted tag -i data.json --in-place -c config.yaml

Options

Option	Description
`-i, --input FILE`	Input Olmsted JSON file
`-o, --output FILE`	Output file path (required unless `--in-place`)
`--in-place`	Modify the input file in place
`-c, --config FILE`	YAML config with custom field declarations
`--json-format {pretty,compact}`	JSON output format (default: pretty)
`-v, --verbose`	Show detailed output

`merge` - Merge External Mutation Data into Olmsted JSON

Attach mutation-level annotations (e.g., surprise scores, selection contributions) from an external CSV onto an existing Olmsted JSON file. For each tree node, mutations are derived from parent/child amino acid sequence diffs and matched against CSV rows by (family, site, parent_aa, child_aa). Matching CSV columns are merged onto the mutation records, and field_metadata is regenerated so the new fields appear in the web app's controls.

The same logic is also available during initial processing via olmsted process --mutations — see the process section.

merge also backfills per-node length/distance from each tree's newick string when the branch lengths are present there but missing on the nodes (the common case for a hand-built base JSON). Without this, the webapp's "evolutionary distance from naive" branch-length mode silently falls back to topological depth. Values already on the nodes are left untouched.

Basic Usage

# Merge mutation scores into an existing Olmsted JSON
olmsted merge -i base.json --mutations scores.csv -o output.json

# In-place modification
olmsted merge -i base.json --mutations scores.csv --in-place

# With a config file (preserves custom_fields declarations)
olmsted merge -i base.json --mutations scores.csv -c config.yaml -o output.json

Options

Option	Description
`-i, --input FILE`	Input Olmsted JSON file
`--mutations FILE`	Mutations CSV file (see format below)
`--mutations-use-depth`	Use the CSV's `depth` column as a match-key participant or integrity check (opt-in)
`--mutations-allow-mismatch`	Downgrade integrity mismatches from a hard failure to a warning
`--mutations-listed-only`	Treat the CSV as authoritative — drop derived mutations on CSV-matched trees that don't appear in the CSV
`-o, --output FILE`	Output file path (required unless `--in-place`)
`--in-place`	Modify the input file in place (refused if zero trees match)
`-c, --config FILE`	YAML config with custom field declarations
`--json-format {pretty,compact}`	JSON output format (default: pretty)
`-v, --verbose {0,1,2,3}`	Verbosity level (use `-v 2` for per-family unmatched detail)

Mutations CSV Format

Required columns:

Column	Description
`family`	Clonal family identifier (joined against `clone_id` in the Olmsted JSON)
`site`	Integer amino acid position (0-based)
`parent_aa`	Single-character parent amino acid
`child_aa`	Single-character child amino acid

Any additional columns become mutation-level fields on matching nodes (e.g., surprise_mutsel, selection_contribution, log_selection_factor). These known structural columns are recognized but not added to the output: sample_id, pcp_index, depth.

Reporting

At normal verbosity, merge reports:

Total CSV rows loaded and number of families
Trees matched, mutations merged, nodes affected
Warning if any CSV families had no matching clone_id in the JSON
Warning if CSV rows in matched families had no corresponding derived mutation

Use -v 2 to see per-family details about which specific (site, parent_aa, child_aa) tuples didn't match.

`build-config` - Generate Config from Data

Introspect your data and generate a YAML config listing processing options, every discoverable field with its inferred type/label/sample values, and cross-format alias suggestions. Edit the config, then use it with process or tag.

Typical Workflow

# 1. Generate a config from your data
olmsted build-config -i data.json -o config.yaml

# 2. Edit config.yaml — remove fields you don't need, fix labels, adjust types

# 3. Use the config to tag your data
olmsted tag -i data.json -o tagged.json -c config.yaml

Options

Option	Description
`-i, --input FILE`	Input Olmsted JSON file to introspect
`-o, --output FILE`	Output YAML file (default: print to stdout)

Example Output

custom_fields:
  # --- Family level (clonal family — scatterplot axes, color, facet) ---
  - name: mean_mut_freq
    level: family
    type: continuous
    label: "Mean Mutation Frequency"
    # sample values: 0.115, 0.056, 0.036, ...

  - name: rearrangement_count
    output_name: unique_seqs_count    # suggested cross-format alias
    level: family
    type: continuous
    label: "Rearrangement Count"

  # --- Mutation level (alignment coloring) ---
  - name: selection_contribution
    level: mutation
    type: continuous
    label: "Selection Contribution"
    # range in data: [-2.5, 5.1]

  # =================================================================
  # Skipped fields (not included in output metadata)
  # =================================================================
  - name: partition
    level: family
    skip: true
    type: tooltip
    label: "Partition"

Configuration Files

Instead of passing all options on the command line, you can use a YAML configuration file. CLI arguments always override config values.

# Use a config file
olmsted process -c config.yaml

# Config with CLI overrides (CLI wins)
olmsted process -c config.yaml -i other_data.csv -o override.json

Default Configs

Default configuration files are included with the package as starting points. Copy one and customize it for your dataset:

Config	Format	Purpose
`pcp.yaml`	PCP	Standard PCP processing with all options documented
`airr.yaml`	AIRR	Standard AIRR processing with all options documented
`olmsted.yaml`	Tag	Custom field declarations for pre-built Olmsted JSON data

To copy a default config:

# Find the configs directory
python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])"

# Copy and customize
cp $(python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])")/pcp.yaml my_config.yaml

Config File Structure

# Standard CLI options (use underscores, not hyphens)
inputs: [data.csv]
output: output/result.json
tree: trees.csv
format: pcp
name: "My Dataset"
description: "Heavy chain BCR data from experiment X"
seed: 42
compute_metrics: true
lbi_tau: 0.0125
verbose: 1
validate: true

# Custom field declarations
custom_fields:
  - name: my_metric
    level: family             # family, node, branch, or mutation
    type: continuous          # continuous, categorical, tooltip, aa, or dna
    label: "My Metric"       # Display label in web app

  - name: internal_id
    level: family
    skip: true                # exclude from output metadata
    type: categorical
    label: "Internal ID"

Custom Fields

The custom_fields section lets you declare additional data fields that should appear in the web app's visualization controls. Each entry supports:

Key	Description
`name`	Field name as it appears in the input data
`output_name`	(optional) Renamed field in output (for cross-format alignment)
`level`	`family` (scatterplot), `node` (tree nodes), `branch` (branches), `mutation` (alignment)
`type`	`continuous`, `categorical`, `tooltip`, `aa` (amino acid), or `dna` (nucleotide)
`label`	Human-readable label for dropdowns and tooltips
`skip`	(optional) `true` to exclude from output metadata
`range`	(optional) `[min, max]` for continuous fields (color scale domain)

Levels: family is the preferred name for the clonal family level (also accepts clone as an alias). The output JSON uses clone internally for backward compatibility.

Types: aa and dna tell the web app to use the full genetic alphabet for color palettes, rather than just the values present in the data.

Standard fields (e.g., unique_seqs_count, v_call, lbi) are auto-detected and don't need to be declared. Use build-config to generate a starting config with all discoverable fields.

`validate` - Validate Data Files

Validate Olmsted/AIRR data files against schemas.

Basic Usage

# Auto-detect file type
olmsted validate data.json

# Validate specific file types
olmsted validate --dataset datasets.json
olmsted validate --clones clones.family1.json clones.family2.json
olmsted validate --tree tree.abc123.json

Options

Option	Description
`--dataset FILE`	Validate as dataset file
`--clone FILE`	Validate as single clone object
`--clones FILES`	Validate as clone collection
`--tree FILE`	Validate as single tree object
`--trees FILES`	Validate as tree collection
`-v, --verbose`	Show detailed validation output
`--strict`	Exit with error on first validation failure

Examples

# Validate complete consolidated file
olmsted validate output.json

# Verbose validation with strict mode
olmsted validate -v --strict processed_data.json

`summary` - Generate Summary Statistics

Analyze consolidated Olmsted data files and generate summary statistics.

Basic Usage

# Print summary to stdout
olmsted summary data.json

# Save summary to file
olmsted summary data.json -o summary.txt

# Output as JSON
olmsted summary --json data.json

Options

Option	Description
`-o, --output FILE`	Output file (default: stdout)
`--json`	Output summary as JSON format

Example Output

Olmsted Data Summary
====================
Datasets: 2
Total Clones: 1,234
Total Tree Nodes: 5,678
  - Leaf Nodes: 2,345
  - Internal Nodes: 3,333

Metrics Available:
  - LBI: Yes
  - LBR: Yes
  - Affinity: Yes
  - Mean Mutation Frequency: Yes

`split` - Split Large Files

Split consolidated Olmsted data files into smaller files for better performance.

Basic Usage

# Split into files with max 100 clones each
olmsted split -i large_data.json -o output_dir --max-clones 100

# Split with custom naming
olmsted split -i data.json -o splits --max-clones 50 --base-name my_dataset

Options

Option	Description
`-i, --input FILE`	Input consolidated JSON file to split
`-o, --output-dir DIR`	Output directory for split files
`--max-clones INT`	Maximum clones per output file (default: 100)
`--base-name NAME`	Base name for output files

Example Data

The repository includes example data for both formats:

# Clone repository to access examples
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli/example-data

# AIRR format examples
ls airr/

# PCP format examples
ls pcp/

Requirements

Python: 3.8 or higher
Dependencies (automatically installed):
- ete3 ≥3.1.0
- jsonschema ≥4.0.0
- lxml ≥4.6.0
- numpy ≥1.20.0
- pyyaml ≥6.0
- scipy ≥1.7.0
- ntpl ≥0.0.4
- tqdm ≥4.65.0

Development Setup

# Clone and install with dev dependencies
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check .

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

david.rich27

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olmsted_cli-0.4.0.tar.gz (733.7 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

olmsted_cli-0.4.0-py3-none-any.whl (168.4 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file olmsted_cli-0.4.0.tar.gz.

File metadata

Download URL: olmsted_cli-0.4.0.tar.gz
Upload date: Jun 12, 2026
Size: 733.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olmsted_cli-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`2904b1a51cae74c303d1a14e99a180a5501e6f1729d833d8208ec91ad3be7289`
MD5	`caa092f2b51a84692cb91e452a3389f8`
BLAKE2b-256	`fd21d2055d199e3136d2eae06ef391a57ba25d04932e9913a64a771ad4d979ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for olmsted_cli-0.4.0.tar.gz:

Publisher: release.yml on matsengrp/olmsted-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: olmsted_cli-0.4.0.tar.gz
- Subject digest: 2904b1a51cae74c303d1a14e99a180a5501e6f1729d833d8208ec91ad3be7289
- Sigstore transparency entry: 1800881590
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: matsengrp/olmsted-cli@4292a64297f1ce060ce864175f8afa8209026b87
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/matsengrp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4292a64297f1ce060ce864175f8afa8209026b87
- Trigger Event: push

File details

Details for the file olmsted_cli-0.4.0-py3-none-any.whl.

File metadata

Download URL: olmsted_cli-0.4.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 168.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for olmsted_cli-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c122b29e742a90a8c84d862c8682839336fb493170f7abfd1bba122a7bc8789`
MD5	`6916764f6b644253c6819e9cfa03443b`
BLAKE2b-256	`acbb37d37372167a00dce1c1d96e279598abdc0879f3c10c63cf035bdd07e0fc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for olmsted_cli-0.4.0-py3-none-any.whl:

Publisher: release.yml on matsengrp/olmsted-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: olmsted_cli-0.4.0-py3-none-any.whl
- Subject digest: 0c122b29e742a90a8c84d862c8682839336fb493170f7abfd1bba122a7bc8789
- Sigstore transparency entry: 1800881846
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: matsengrp/olmsted-cli@4292a64297f1ce060ce864175f8afa8209026b87
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/matsengrp
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4292a64297f1ce060ce864175f8afa8209026b87
- Trigger Event: push

olmsted-cli 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

olmsted-cli

Overview

Typical Workflow

Supported Formats

Output Formats

Installation

Recommended (pipx)

Standard Installation

From source (development / latest)

Quick Start

Available Commands

Overview

Commands

process - Process Data Files

Basic Usage

Input/Output Options

Processing Options

PCP-Specific Options

AIRR-Specific Options

Examples

Input Formats

tag - Add Field Metadata to Existing Files

Basic Usage

Options

merge - Merge External Mutation Data into Olmsted JSON

Basic Usage

Options

Mutations CSV Format

Reporting

build-config - Generate Config from Data

Typical Workflow

Options

Example Output

Configuration Files

Default Configs

Config File Structure

Custom Fields

validate - Validate Data Files

Basic Usage

Options

Examples

summary - Generate Summary Statistics

Basic Usage

Options

Example Output

split - Split Large Files

Basic Usage

Options

Example Data

Requirements

Development Setup

Links

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

`process` - Process Data Files

`tag` - Add Field Metadata to Existing Files

`merge` - Merge External Mutation Data into Olmsted JSON

`build-config` - Generate Config from Data

`validate` - Validate Data Files

`summary` - Generate Summary Statistics

`split` - Split Large Files