Command-line interface for Olmsted data processing
Project description
olmsted-cli
Command-line interface and data processing utilities for the Olmsted webapp. The Olmsted web application can be launched locally through the git repository, or is also available at https://www.olmstedviz.org.
See also:
- FORMATS.md: Input/output format specifications, field mapping, validation
- ARCHITECTURE.md: System architecture, data flow, design decisions
- DEVELOPMENT.md: Development guide, testing, contributing
- CLAUDE.md: AI assistant guidance, code quality rules
Overview
olmsted-cli is a Python package that processes immunological data from AIRR and PCP formats into the Olmsted JSON format for visualization in the Olmsted web application. It handles sequencing data, reconstructs phylogenetic trees, and calculates various metrics for clonal family analysis.
Typical Workflow
- Process your data: Use
olmsted-clito convert your AIRR or PCP format files into Olmsted JSON format - Open Olmsted web app: Launch the application locally or visit https://www.olmstedviz.org
- Load your processed files: Upload the Olmsted JSON file(s)
- Visualize: Explore your data with interactive visualizations
Example:
# Convert your PCP data to Olmsted format
olmsted process -i pcp.csv --tree trees.csv -o olmsted_data.json --compute-metrics
Supported Formats
- AIRR (Adaptive Immune Receptor Repertoire): JSON format following AIRR Community standards
- PCP (Parent-Child Pair): CSV file containing parent-child pairs with separate trees CSV file containing Newick strings
Output Formats
Consolidated (default): Single JSON file containing all data - recommended for most workflows.
Unbundled (--unbundle): Separates data into component files (datasets.json, clones..json, tree..json) for backwards compatibility with older Olmsted versions.
Installation
Recommended (pipx)
Install using pipx for isolated environment:
pipx install olmsted-cli
Standard Installation
Install using pip:
pip install olmsted-cli
From source (development / latest)
Install the latest unreleased code straight from GitHub:
pipx install git+https://github.com/matsengrp/olmsted-cli.git
# or, for a local checkout you intend to edit:
pip install -e ".[dev]"
Quick Start
# Process AIRR format data (auto-detected)
olmsted process -i data.json -o output/olmsted_data.json
# Process PCP format data with phylogenetic metrics
olmsted process -i sequences.csv --tree trees.csv -o output/data.json --compute-metrics
Available Commands
Overview
| Command | Purpose |
|---|---|
process |
This is the primary tool: Converts input AIRR or PCP format data into Olmsted-readable JSON format |
tag |
Add field metadata to existing Olmsted JSON files |
merge |
Merge external mutation-level CSV data into existing Olmsted JSON files |
build-config |
Generate a YAML config from your data for editing |
validate |
Verify data files conform to Olmsted schema |
summary |
Generate statistics and metadata report for processed data |
split |
Divide large consolidated files into smaller chunks for performance |
Commands
process - Process Data Files
Convert AIRR or PCP format data into Olmsted JSON format.
Basic Usage
# Auto-detect format
olmsted process -i input.json -o output.json
# Explicitly specify format
olmsted process -i input.csv -f pcp -o output.json
Input/Output Options
| Option | Description |
|---|---|
-i, --inputs FILES |
Input file(s). For AIRR: one or more JSON files. For PCP: CSV file |
-o, --output FILE |
Output file path for consolidated JSON |
--unbundle DIR |
Unbundle output into separate component files (datasets.json, clones..json, tree..json) for backwards compatibility with Olmsted web app |
-f, --format {airr,pcp,auto} |
Input format (default: auto-detect) |
-t, --tree FILE |
Trees file for PCP format (optional, can be gzipped) |
--mutations FILE |
Mutation-level CSV file to merge into tree nodes after processing (see merge command) |
-c, --config FILE |
YAML configuration file (CLI arguments override config values) |
Processing Options
| Option | Description |
|---|---|
-n, --name NAME |
Optional dataset name (stored in metadata) |
--validate |
Validate output against schemas before writing |
--strict-validation |
Exit with error if validation fails |
--allow-duplicate-ids |
Downgrade duplicate-*_id errors to warnings and pass data through unchanged. Without this flag, processing fails when dataset_id, clone_id, tree_id, sample_id, or subject_id collide within their natural uniqueness scope. |
--seed INT |
Random seed for deterministic UUID generation |
--batch-size N |
Clonal families per streaming batch (default: 50). Bounds peak memory by spooling each batch's clones/trees to disk and stream-stitching the final consolidated JSON. Pass 0 to disable streaming and use the legacy one-shot path. Small inputs (n_families ≤ batch_size) skip the spool automatically. |
-v, --verbose {0,1,2,3} |
Verbosity: 0=quiet, 1=normal (default), 2=verbose, 3=debug |
-q, --quiet |
Quiet mode - only show errors (equivalent to -v 0) |
-w, --warnings |
Show warnings when tree and PCP data disagree (PCP only) |
PCP-Specific Options
| Option | Description |
|---|---|
--compute-metrics |
Compute LBI, LBR, affinity, and mutation frequency for all nodes |
--lbi-tau FLOAT |
Time scale parameter for LBI calculation (default: 0.0125) |
--standardize-names |
Rename nodes to standard format: naive (root), Node1, Node2, ... |
AIRR-Specific Options
| Option | Description |
|---|---|
--naive-name NAME |
Name of naive/root node for tree rooting (default: "naive") |
-r, --root-trees |
Root trees using naive node |
Examples
# Auto-detect AIRR format and process multiple input files
olmsted process -i dataset1.json dataset2.json -o combined.json
# Process PCP format with separate trees file and compute metrics
olmsted process -i sequences.csv --tree trees.csv -o output.json --compute-metrics
Input Formats
PCP CSV Format
Expected columns in the PCP CSV file:
| Column | Description |
|---|---|
sample_id |
Sample identifier |
family |
Clonal family identifier |
parent_name |
Parent node name (use "naive" for root) |
parent_heavy |
Parent heavy chain sequence |
child_name |
Child node name |
child_heavy |
Child heavy chain sequence |
branch_length |
Branch length between parent and child |
depth |
Depth in tree |
distance |
Distance from root |
v_gene_heavy |
V gene assignment |
j_gene_heavy |
J gene assignment |
cdr1_codon_start_heavy |
CDR1 start position |
cdr1_codon_end_heavy |
CDR1 end position |
cdr2_codon_start_heavy |
CDR2 start position |
cdr2_codon_end_heavy |
CDR2 end position |
cdr3_codon_start_heavy |
CDR3 start position |
cdr3_codon_end_heavy |
CDR3 end position |
parent_is_naive |
Boolean indicating if parent is naive/root |
child_is_leaf |
Boolean indicating if child is a leaf node |
Trees CSV Format
Expected columns in the trees file:
| Column | Description |
|---|---|
family_name |
Clonal family identifier (must match family in PCP CSV) |
sample_id |
Sample identifier (must match sample_id in PCP CSV) |
newick_tree |
Newick format tree string for the family |
tree_id (optional) |
Stable per-tree identifier. Required to disambiguate multiple rows per (family_name, sample_id) — i.e. alternate phylogenetic reconstructions of the same clonal family. Synthesized as tree-{family_id} when absent. |
reconstruction_method (optional) |
Label for the method that built this tree (e.g. "dnapars", "raxml_ng"). Written to tree.reconstruction_method on the output; left unset when the column is absent. |
Multiple rows with the same (family_name, sample_id) produce multiple entries in clone.trees[] (one per alternate reconstruction). Duplicate tree_id values within a clone fail the output uniqueness check; see --allow-duplicate-ids to opt out.
tag - Add Field Metadata to Existing Files
Add field_metadata to pre-built Olmsted JSON files. This is useful for data produced outside the standard process pipeline.
Basic Usage
# Introspect fields and add metadata
olmsted tag -i data.json -o tagged.json
# With custom field declarations
olmsted tag -i data.json -o tagged.json -c config.yaml
# Modify file in place
olmsted tag -i data.json --in-place -c config.yaml
Options
| Option | Description |
|---|---|
-i, --input FILE |
Input Olmsted JSON file |
-o, --output FILE |
Output file path (required unless --in-place) |
--in-place |
Modify the input file in place |
-c, --config FILE |
YAML config with custom field declarations |
--json-format {pretty,compact} |
JSON output format (default: pretty) |
-v, --verbose |
Show detailed output |
merge - Merge External Mutation Data into Olmsted JSON
Attach mutation-level annotations (e.g., surprise scores, selection contributions) from an external CSV onto an existing Olmsted JSON file. For each tree node, mutations are derived from parent/child amino acid sequence diffs and matched against CSV rows by (family, site, parent_aa, child_aa). Matching CSV columns are merged onto the mutation records, and field_metadata is regenerated so the new fields appear in the web app's controls.
The same logic is also available during initial processing via olmsted process --mutations — see the process section.
merge also backfills per-node length/distance from each tree's newick string when the branch lengths are present there but missing on the nodes (the common case for a hand-built base JSON). Without this, the webapp's "evolutionary distance from naive" branch-length mode silently falls back to topological depth. Values already on the nodes are left untouched.
Basic Usage
# Merge mutation scores into an existing Olmsted JSON
olmsted merge -i base.json --mutations scores.csv -o output.json
# In-place modification
olmsted merge -i base.json --mutations scores.csv --in-place
# With a config file (preserves custom_fields declarations)
olmsted merge -i base.json --mutations scores.csv -c config.yaml -o output.json
Options
| Option | Description |
|---|---|
-i, --input FILE |
Input Olmsted JSON file |
--mutations FILE |
Mutations CSV file (see format below) |
--mutations-use-depth |
Use the CSV's depth column as a match-key participant or integrity check (opt-in) |
--mutations-allow-mismatch |
Downgrade integrity mismatches from a hard failure to a warning |
--mutations-listed-only |
Treat the CSV as authoritative — drop derived mutations on CSV-matched trees that don't appear in the CSV |
-o, --output FILE |
Output file path (required unless --in-place) |
--in-place |
Modify the input file in place (refused if zero trees match) |
-c, --config FILE |
YAML config with custom field declarations |
--json-format {pretty,compact} |
JSON output format (default: pretty) |
-v, --verbose {0,1,2,3} |
Verbosity level (use -v 2 for per-family unmatched detail) |
Mutations CSV Format
Required columns:
| Column | Description |
|---|---|
family |
Clonal family identifier (joined against clone_id in the Olmsted JSON) |
site |
Integer amino acid position (0-based) |
parent_aa |
Single-character parent amino acid |
child_aa |
Single-character child amino acid |
Any additional columns become mutation-level fields on matching nodes (e.g., surprise_mutsel, selection_contribution, log_selection_factor). These known structural columns are recognized but not added to the output: sample_id, pcp_index, depth.
Reporting
At normal verbosity, merge reports:
- Total CSV rows loaded and number of families
- Trees matched, mutations merged, nodes affected
- Warning if any CSV families had no matching
clone_idin the JSON - Warning if CSV rows in matched families had no corresponding derived mutation
Use -v 2 to see per-family details about which specific (site, parent_aa, child_aa) tuples didn't match.
build-config - Generate Config from Data
Introspect your data and generate a YAML config listing processing options, every discoverable field with its inferred type/label/sample values, and cross-format alias suggestions. Edit the config, then use it with process or tag.
Typical Workflow
# 1. Generate a config from your data
olmsted build-config -i data.json -o config.yaml
# 2. Edit config.yaml — remove fields you don't need, fix labels, adjust types
# 3. Use the config to tag your data
olmsted tag -i data.json -o tagged.json -c config.yaml
Options
| Option | Description |
|---|---|
-i, --input FILE |
Input Olmsted JSON file to introspect |
-o, --output FILE |
Output YAML file (default: print to stdout) |
Example Output
custom_fields:
# --- Family level (clonal family — scatterplot axes, color, facet) ---
- name: mean_mut_freq
level: family
type: continuous
label: "Mean Mutation Frequency"
# sample values: 0.115, 0.056, 0.036, ...
- name: rearrangement_count
output_name: unique_seqs_count # suggested cross-format alias
level: family
type: continuous
label: "Rearrangement Count"
# --- Mutation level (alignment coloring) ---
- name: selection_contribution
level: mutation
type: continuous
label: "Selection Contribution"
# range in data: [-2.5, 5.1]
# =================================================================
# Skipped fields (not included in output metadata)
# =================================================================
- name: partition
level: family
skip: true
type: tooltip
label: "Partition"
Configuration Files
Instead of passing all options on the command line, you can use a YAML configuration file. CLI arguments always override config values.
# Use a config file
olmsted process -c config.yaml
# Config with CLI overrides (CLI wins)
olmsted process -c config.yaml -i other_data.csv -o override.json
Default Configs
Default configuration files are included with the package as starting points. Copy one and customize it for your dataset:
| Config | Format | Purpose |
|---|---|---|
pcp.yaml |
PCP | Standard PCP processing with all options documented |
airr.yaml |
AIRR | Standard AIRR processing with all options documented |
olmsted.yaml |
Tag | Custom field declarations for pre-built Olmsted JSON data |
To copy a default config:
# Find the configs directory
python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])"
# Copy and customize
cp $(python -c "import olmsted_cli.configs; print(olmsted_cli.configs.__path__[0])")/pcp.yaml my_config.yaml
Config File Structure
# Standard CLI options (use underscores, not hyphens)
inputs: [data.csv]
output: output/result.json
tree: trees.csv
format: pcp
name: "My Dataset"
description: "Heavy chain BCR data from experiment X"
seed: 42
compute_metrics: true
lbi_tau: 0.0125
verbose: 1
validate: true
# Custom field declarations
custom_fields:
- name: my_metric
level: family # family, node, branch, or mutation
type: continuous # continuous, categorical, tooltip, aa, or dna
label: "My Metric" # Display label in web app
- name: internal_id
level: family
skip: true # exclude from output metadata
type: categorical
label: "Internal ID"
Custom Fields
The custom_fields section lets you declare additional data fields that should appear in the web app's visualization controls. Each entry supports:
| Key | Description |
|---|---|
name |
Field name as it appears in the input data |
output_name |
(optional) Renamed field in output (for cross-format alignment) |
level |
family (scatterplot), node (tree nodes), branch (branches), mutation (alignment) |
type |
continuous, categorical, tooltip, aa (amino acid), or dna (nucleotide) |
label |
Human-readable label for dropdowns and tooltips |
skip |
(optional) true to exclude from output metadata |
range |
(optional) [min, max] for continuous fields (color scale domain) |
Levels: family is the preferred name for the clonal family level (also accepts clone as an alias). The output JSON uses clone internally for backward compatibility.
Types: aa and dna tell the web app to use the full genetic alphabet for color palettes, rather than just the values present in the data.
Standard fields (e.g., unique_seqs_count, v_call, lbi) are auto-detected and don't need to be declared. Use build-config to generate a starting config with all discoverable fields.
validate - Validate Data Files
Validate Olmsted/AIRR data files against schemas.
Basic Usage
# Auto-detect file type
olmsted validate data.json
# Validate specific file types
olmsted validate --dataset datasets.json
olmsted validate --clones clones.family1.json clones.family2.json
olmsted validate --tree tree.abc123.json
Options
| Option | Description |
|---|---|
--dataset FILE |
Validate as dataset file |
--clone FILE |
Validate as single clone object |
--clones FILES |
Validate as clone collection |
--tree FILE |
Validate as single tree object |
--trees FILES |
Validate as tree collection |
-v, --verbose |
Show detailed validation output |
--strict |
Exit with error on first validation failure |
Examples
# Validate complete consolidated file
olmsted validate output.json
# Verbose validation with strict mode
olmsted validate -v --strict processed_data.json
summary - Generate Summary Statistics
Analyze consolidated Olmsted data files and generate summary statistics.
Basic Usage
# Print summary to stdout
olmsted summary data.json
# Save summary to file
olmsted summary data.json -o summary.txt
# Output as JSON
olmsted summary --json data.json
Options
| Option | Description |
|---|---|
-o, --output FILE |
Output file (default: stdout) |
--json |
Output summary as JSON format |
Example Output
Olmsted Data Summary
====================
Datasets: 2
Total Clones: 1,234
Total Tree Nodes: 5,678
- Leaf Nodes: 2,345
- Internal Nodes: 3,333
Metrics Available:
- LBI: Yes
- LBR: Yes
- Affinity: Yes
- Mean Mutation Frequency: Yes
split - Split Large Files
Split consolidated Olmsted data files into smaller files for better performance.
Basic Usage
# Split into files with max 100 clones each
olmsted split -i large_data.json -o output_dir --max-clones 100
# Split with custom naming
olmsted split -i data.json -o splits --max-clones 50 --base-name my_dataset
Options
| Option | Description |
|---|---|
-i, --input FILE |
Input consolidated JSON file to split |
-o, --output-dir DIR |
Output directory for split files |
--max-clones INT |
Maximum clones per output file (default: 100) |
--base-name NAME |
Base name for output files |
Example Data
The repository includes example data for both formats:
# Clone repository to access examples
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli/example-data
# AIRR format examples
ls airr/
# PCP format examples
ls pcp/
Requirements
- Python: 3.8 or higher
- Dependencies (automatically installed):
- ete3 ≥3.1.0
- jsonschema ≥4.0.0
- lxml ≥4.6.0
- numpy ≥1.20.0
- pyyaml ≥6.0
- scipy ≥1.7.0
- ntpl ≥0.0.4
- tqdm ≥4.65.0
Development Setup
# Clone and install with dev dependencies
git clone https://github.com/matsengrp/olmsted-cli.git
cd olmsted-cli
pip install -e ".[dev]"
# Run tests
pytest
# Run linter
ruff check .
Links
- Olmsted Web App: https://github.com/matsengrp/olmsted
- Live Web App: https://olmstedviz.org
Last updated: 2026-06-09
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file olmsted_cli-0.4.0.tar.gz.
File metadata
- Download URL: olmsted_cli-0.4.0.tar.gz
- Upload date:
- Size: 733.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2904b1a51cae74c303d1a14e99a180a5501e6f1729d833d8208ec91ad3be7289
|
|
| MD5 |
caa092f2b51a84692cb91e452a3389f8
|
|
| BLAKE2b-256 |
fd21d2055d199e3136d2eae06ef391a57ba25d04932e9913a64a771ad4d979ce
|
Provenance
The following attestation bundles were made for olmsted_cli-0.4.0.tar.gz:
Publisher:
release.yml on matsengrp/olmsted-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olmsted_cli-0.4.0.tar.gz -
Subject digest:
2904b1a51cae74c303d1a14e99a180a5501e6f1729d833d8208ec91ad3be7289 - Sigstore transparency entry: 1800881590
- Sigstore integration time:
-
Permalink:
matsengrp/olmsted-cli@4292a64297f1ce060ce864175f8afa8209026b87 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/matsengrp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4292a64297f1ce060ce864175f8afa8209026b87 -
Trigger Event:
push
-
Statement type:
File details
Details for the file olmsted_cli-0.4.0-py3-none-any.whl.
File metadata
- Download URL: olmsted_cli-0.4.0-py3-none-any.whl
- Upload date:
- Size: 168.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c122b29e742a90a8c84d862c8682839336fb493170f7abfd1bba122a7bc8789
|
|
| MD5 |
6916764f6b644253c6819e9cfa03443b
|
|
| BLAKE2b-256 |
acbb37d37372167a00dce1c1d96e279598abdc0879f3c10c63cf035bdd07e0fc
|
Provenance
The following attestation bundles were made for olmsted_cli-0.4.0-py3-none-any.whl:
Publisher:
release.yml on matsengrp/olmsted-cli
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
olmsted_cli-0.4.0-py3-none-any.whl -
Subject digest:
0c122b29e742a90a8c84d862c8682839336fb493170f7abfd1bba122a7bc8789 - Sigstore transparency entry: 1800881846
- Sigstore integration time:
-
Permalink:
matsengrp/olmsted-cli@4292a64297f1ce060ce864175f8afa8209026b87 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/matsengrp
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4292a64297f1ce060ce864175f8afa8209026b87 -
Trigger Event:
push
-
Statement type: