Python utilities used by cyto
Project description
pycyto
Python utilities for cyto - format conversion and sample aggregation.
Overview
pycyto provides command-line tools for working with cyto outputs:
- convert: Transform MTX format to h5ad (AnnData)
- aggregate: Combine multi-probe cyto outputs into unified sample-level datasets
Installation
uv tool install pycyto
pycyto --help
Commands
convert
Convert Matrix Market (MTX) format from cyto to h5ad for downstream analysis.
pycyto convert <mtx_directory> <output.h5ad>
Arguments:
mtx_directory: Path to MTX directory (containingmatrix.mtx,features.tsv,barcodes.tsv) or direct path to.mtxfileoutput.h5ad: Output h5ad file path
Options:
--compress / --no-compress: Enable gzip compression in h5ad (default: enabled)--integer / --no-integer: Store counts as int32 instead of float32 (default: float32)
Examples:
# Convert MTX directory to h5ad
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad
# Convert without compression
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad --no-compress
# Store as integers for memory efficiency
pycyto convert cyto_out/counts/BC001.counts.mtx sample.h5ad --integer
Input structure (from cyto ibu count --format mtx):
mtx_directory/
├── matrix.mtx # Sparse count matrix (gene × cell)
├── features.tsv # Gene/feature names
└── barcodes.tsv # Cell barcodes
Output: AnnData object (cell × gene) in CSR sparse format, ready for scanpy/seurat workflows
aggregate
Aggregate multi-probe cyto outputs across flex barcodes into unified sample-level datasets. Designed specifically for single-modal (GEX) and multi-modal Flex experiments (GEX + CRISPR).
pycyto aggregate <config.json> <cyto_outdir> <output_directory>
Arguments:
config.json: JSON configuration specifying sample structure and barcode assignmentscyto_outdir: Directory containing cyto workflow outputsoutput_directory: Where to write aggregated files
Options:
--compress / --no-compress: Compress output h5ad files (default: no compression)--threads INT: Number of parallel sample processing threads (default: -1 for all cores)--verbose: Enable detailed logging
What it does:
- Concatenates data across multiple flex probe barcodes per sample
- Merges GEX and CRISPR modalities when both present
- Adds guide assignments to GEX cell metadata
- Filters CRISPR data to match filtered GEX cells (useful if using alternative guide-assignment algorithms)
- Preserves per-cell read/UMI statistics
Output structure (per sample):
output_directory/
└── sample_name/
├── sample_name_gex.h5ad # Gene expression data
├── sample_name_crispr.h5ad # Guide RNA counts (GEX-filtered)
├── sample_name_assignments.parquet # Guide assignments per cell
└── sample_name_reads.parquet # Read/UMI statistics per barcode
Configuration Format
Basic Structure
Configuration files require:
libraries: Named paths to feature files (probe lists, guide libraries)samples: Array of sample specifications
Sample Specification
Each sample requires:
experiment: Experiment identifier (must match cyto output directory names)sample: Unique sample name (used for output files)mode: Processing mode (gex,crispr, orgex+crispr)features: Which library/libraries to use (must match mode with+)barcodes: Flex probe barcode assignments
Matches an output directory of the following path structure:
cyto_output_directory/
└── [experiment_name]_[mode]_Lane*/
└── ...
Note: All
Lanes for anExperiment+Modewill be concatenated. If you have differing barcode poolings based onLaneyou will need to adjust yourExperimentname to reflect that.
Barcode Syntax
Single barcode:
"barcodes": "BC001"
Barcode range (expands to BC001, BC002, BC003):
"barcodes": "BC1..3"
Non-contiguous selection:
"barcodes": "BC001|BC003|BC005"
Combined range and selection:
"barcodes": "BC1..3|BC005|BC7..9"
Multi-modal pairing (GEX + CRISPR on same cells):
"mode": "gex+crispr",
"features": "GEX_PROBE_LIST+CRISPR_PROBE_LIST",
"barcodes": "BC1..8+CR1..8"
Pairs BC001+CR001, BC002+CR002, ..., BC008+CR008
Multiple independent combinations:
"barcodes": "BC001+CR001|BC002+CR002"
Processes BC001+CR001 as one pair, BC002+CR002 as another
Configuration Examples
Example 1: Simple GEX experiment
{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv"
},
"samples": [
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX_PROBES",
"sample": "control",
"barcodes": "BC1..4"
},
{
"experiment": "exp1",
"mode": "gex",
"features": "GEX_PROBES",
"sample": "treatment",
"barcodes": "BC5..8"
}
]
}
Example 2: Multi-modal Perturb-seq
{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv",
"GUIDE_LIBRARY": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq_screen",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDE_LIBRARY",
"sample": "screen_replicate1",
"barcodes": "BC1..8+CR1..8"
},
{
"experiment": "perturbseq_screen",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDE_LIBRARY",
"sample": "screen_replicate2",
"barcodes": "BC9..16+CR9..16"
}
]
}
Example 3: Developmental timecourse
{
"libraries": {
"GEX_PROBES": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day0",
"barcodes": "BC1..4+CR1..4"
},
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day7",
"barcodes": "BC5..7+CR5..7"
},
{
"experiment": "timecourse_20250101",
"mode": "gex+crispr",
"features": "GEX_PROBES+GUIDES",
"sample": "day14",
"barcodes": "BC8..12+CR8..12"
}
]
}
Typical Workflows
Single Sample Conversion
# 1. Run cyto workflow
cyto workflow gex -c probes.tsv -w whitelist.txt -o cyto_out sample.vbq
# 2. Convert probe barcode to h5ad
pycyto convert cyto_out/counts/BC001.counts.mtx sample_BC001.h5ad
Multi-Modal Perturb-seq Aggregation
# 1. Run cyto for GEX and CRISPR
cyto workflow gex -c gex_probes.tsv -w whitelist.txt -p probes.txt -o cyto_out/perturbseq_GEX_Lane1 sample.vbq
cyto workflow crispr -c guides.tsv -w whitelist.txt -p probes.txt -o cyto_out/perturbseq_CRISPR_Lane1 sample.vbq
# 2. Create aggregation config
cat > config.json << 'EOF'
{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "perturbseq",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "mysample",
"barcodes": "BC1..16+CR1..16"
}
]
}
EOF
# 3. Aggregate (merges GEX + CRISPR modalities)
pycyto aggregate config.json ./cyto_out ./aggr
# Output:
# ./aggr/mysample/
# ├── mysample_gex.h5ad # Gene expression with guide annotations
# ├── mysample_crispr.h5ad # Guide counts (filtered to GEX cells)
# ├── mysample_assignments.parquet # Guide assignments
# └── mysample_reads.parquet # QC statistics
Multiple Samples from Same Experiment
# 1. Run cyto workflows
cyto workflow gex -c probes.tsv -w whitelist.txt -p probes.txt -o cyto_out/exp_GEX_Lane1 samples.vbq
cyto workflow crispr -c guides.tsv -w whitelist.txt -p probes.txt -o cyto_out/exp_CRISPR_Lane1 samples.vbq
# 2. Create config assigning barcodes to biological samples
cat > config.json << 'EOF'
{
"libraries": {
"GEX": "./gex_probes.tsv",
"GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "control_rep1",
"barcodes": "BC1..2+CR1..2"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "control_rep2",
"barcodes": "BC3..4+CR3..4"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "treatment_rep1",
"barcodes": "BC5..6+CR5..6"
},
{
"experiment": "exp",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "treatment_rep2",
"barcodes": "BC7..8+CR7..8"
}
]
}
EOF
# 3. Aggregate all samples in parallel
pycyto aggregate config.json ./cyto_out ./aggr
Aggregation Details
What Gets Aggregated
For each sample, aggregate combines data across all specified flex barcodes:
GEX mode: Concatenates gene expression matrices
CRISPR mode: Concatenates guide count matrices
GEX + CRISPR mode:
- Merges guide assignments into GEX
.obsmetadata - Filters CRISPR data to cells present in filtered GEX data
- Adds read/UMI statistics for both modalities
Cell Metadata in Aggregated h5ad
The aggregated GEX h5ad includes:
experiment: Experiment identifiersample: Sample nameflex_barcode: Original flex probe barcode (e.g., "BC001")lane_id: Sequencing lane identifierassignment: Assigned guide(s) from CRISPR datamoi: Multiplicity of infection (number of guides per cell)umis: Guide UMI countsn_reads_gex: Total GEX reads per celln_umis_gex: Total GEX UMIs per celln_reads_crispr: Total CRISPR reads per celln_umis_crispr: Total CRISPR UMIs per cell
Barcode Matching
When pairing GEX and CRISPR data:
- CRISPR barcodes (CR) are automatically converted to match GEX format (BC) for cell matching
- Cells are matched on:
cell_barcode + flex_barcode + lane_id - Only cells present in filtered GEX data are retained in CRISPR output
Cyto Output Directory Structure
aggregate expects cyto outputs organized as:
cyto_outdir/
├── {experiment}_GEX_Lane*/
│ └── counts/
│ ├── BC001.h5ad
│ ├── BC002.h5ad
│ └── ...
└── {experiment}_CRISPR_Lane*/
├── counts/
│ ├── CR001.h5ad
│ └── ...
└── assignments/
├── CR001.assignments.tsv
└── ...
Where {experiment} matches the experiment field in your config.
Advanced Usage
Processing Subset of Barcodes
Process only specific barcode combinations by modifying the config:
{
"samples": [
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "GEX+GUIDES",
"sample": "high_quality_subset",
"barcodes": "BC001+CR001|BC003+CR003|BC007+CR007"
}
]
}
Different Feature Libraries per Sample
Reference different probe/guide libraries:
{
"libraries": {
"TISSUE_PANEL": "./tissue_probes.tsv",
"IMMUNE_PANEL": "./immune_probes.tsv",
"SCREEN_GUIDES": "./guides.tsv"
},
"samples": [
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "TISSUE_PANEL+SCREEN_GUIDES",
"sample": "tissue_sample",
"barcodes": "BC1..8+CR1..8"
},
{
"experiment": "exp1",
"mode": "gex+crispr",
"features": "IMMUNE_PANEL+SCREEN_GUIDES",
"sample": "pbmc_sample",
"barcodes": "BC9..16+CR9..16"
}
]
}
Parallel Processing Control
# Use all available cores
pycyto aggregate config.json cyto_out output --threads -1
# Limit to 8 samples processed simultaneously
pycyto aggregate config.json cyto_out output --threads 8
# Single-threaded (minimal memory)
pycyto aggregate config.json cyto_out output --threads 1
Troubleshooting
Missing Files Error
Problem: "Expected Feature file does not exist"
Solution: Ensure MTX directory contains matrix.mtx, features.tsv, and barcodes.tsv
Barcode Format Errors
Problem: "Invalid barcode format"
Solution: Barcodes must be BC/CR/AB followed by numbers. Use range syntax: BC1..8 not BC1-8
No Data Found
Problem: "No data found to process for sample"
Solution:
- Verify experiment name in config matches cyto output directory prefix
- Check that barcode specifications match available probe barcodes in cyto output
- Ensure cyto outputs are in expected directory structure
Memory Issues
Problem: Out of memory during aggregation
Solution:
- Reduce
--threadsto process fewer samples concurrently - Use
--compressto reduce output file sizes - Process samples in smaller batches with separate configs
Performance Notes
- Parallel aggregation: Samples are processed independently in parallel (one per thread)
- Lazy loading: Uses anndata experimental lazy loading to minimize memory overhead
- Compression: Optional compression reduces h5ad file sizes ~40-50%
- Integer storage:
--integerflag inconvertreduces memory vs float32
See Also
- cyto: Main processing pipeline - https://github.com/arcinstitute/cyto
- anndata: Annotated data structures - https://anndata.readthedocs.io
Citation
If you use pycyto in your research, please cite:
Teyssier, N. and Dobin, A. (2025). cyto: ultra-high throughput processing
of 10x-flex single cell sequencing. bioRxiv.
Support
- Issues: https://github.com/arcinstitute/pycyto/issues
- Documentation: Run
pycyto --helporpycyto <command> --help
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycyto-0.1.14.tar.gz.
File metadata
- Download URL: pycyto-0.1.14.tar.gz
- Upload date:
- Size: 527.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b0adb0bca7e3a919b129cab4dbcff8f0d517cea80ff075d658c50cb256bcc29
|
|
| MD5 |
0d49eb441e9150d5e1431f65a9c18adf
|
|
| BLAKE2b-256 |
235d70c3c6532f852d7a7db8c4ce3170c4fc235856a6168d6be8eff4e8e57da5
|
File details
Details for the file pycyto-0.1.14-py3-none-any.whl.
File metadata
- Download URL: pycyto-0.1.14-py3-none-any.whl
- Upload date:
- Size: 529.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4606672fcc32e5bd2adffdd0f8c1bfc5788da41939f0116458a3451c341d80cf
|
|
| MD5 |
a112d6373c9bc78004d6c54c79eba009
|
|
| BLAKE2b-256 |
f5777968185a3533a852bbb839a48939ad037c77e84fbf91d56d5cd223bdf4bf
|