Skip to main content

SniffCell: Annotate SVs cell type based on CpG methylation

Project description

SniffCell - annotate structural variants with methylation-derived cell-type signals

PyPI version Install Docs Issues

SniffCell analyzes long-read methylation around SVs and provides cell-type-aware annotations.

Version

Current package version in code: v0.6.0.

Install

pip install sniffcell

For local development:

pip install -e .

CLI commands

sniffcell {find,deconv,anno,svanno,dmsv,viz}

Command status in the current code:

  • find: implemented.
  • anno: implemented.
  • svanno: implemented.
  • dmsv: implemented.
  • viz: implemented.
  • deconv: placeholder stub (currently prints args only).

Input assumptions

  • BAM: long-read BAM with modified base tags; HP haplotype tag is optional.
  • Reference: FASTA indexed for region fetches.
  • VCF: INS and DEL records are used.
  • VCF INFO field RNAMES is used for supporting reads unless overridden by --kanpig_read_names.
  • VCF INFO fields STDEV_POS, STDEV_LEN, SVLEN are used to derive ref_start and ref_end windows.
  • BED for anno: one tab-delimited hierarchical DMR file from sniffcell find with at least chr, start, end, best_group, best_dir.

find: call hierarchical ctDMRs from atlas matrices

Finds cell-type-specific DMR regions from an explicit hierarchy schema in atlas/index_to_major_celltypes.json, then writes one annotation-ready BED/TSV.

Hierarchy schema:

  • Add a top-level __hierarchy__ object.
  • Define each hierarchy key with source_key and optional children.
  • Each child can point to another source_key and optional groups.

Example:

"__hierarchy__": {
  "pbmc-lymphocytes": {
    "source_key": "pbmc-lymphocytes",
    "children": {
      "lymphocytes": {
        "source_key": "pbmc",
        "groups": ["T-cell", "NK-cell", "B-cell"]
      }
    }
  }
}

Example:

sniffcell find \
  -n atlas/all_celltypes_blocks.npy \
  -i atlas/all_celltypes_blocks.index.gz \
  -cf atlas/index_to_major_celltypes.json \
  -m atlas/all_celltypes.txt \
  -ck pbmc-lymphocytes \
  -o pbmc_hierarchy.tsv \
  --diff_threshold 0.40 \
  --min_rows 2 \
  --min_cpgs 3 \
  --max_gap_bp 500

Outputs:

  • <output>: annotation-ready hierarchical BED/TSV for sniffcell anno.
  • <output>.igv.bed: companion IGV BED9 (headerless, IGV-ready).

Key columns in <output> include:

  • best_group, best_dir
  • code_order (global leaf schema)
  • best_group_leaves, other_group_leaves
  • hierarchy_level, hierarchy_path, hierarchy_source_key
  • per-node means (mean_<group>)

anno: annotate SVs with one hierarchical BED file

anno processes DMR regions near SVs, classifies reads per region, then summarizes per-SV assignment.

Basic example:

sniffcell anno \
  -i sample.bam \
  -v sample.vcf.gz \
  -r ref.fa \
  -b pbmc_hierarchy.tsv \
  -o anno_out \
  -w 10000 \
  -t 8

anno outputs:

  • reads_classification.tsv: per-read region-level assignments.
  • blocks_classification.tsv: per-region methylation summaries.
  • sv_assignment.tsv: SV-level assignment summary (produced by running svanno internally at end of anno).
  • sv_assignment_readable.tsv: readable SV summary focused on classified cell types per SV.
  • sv_assignment_readable_long.tsv: long-format SV x celltype table with counts/fractions.
  • anno_run_manifest.json: run log/manifest with input paths and outputs (used by sniffcell viz --anno_output).

SV assignment options (available in both anno and svanno):

  • --evidence_mode {all_rows,per_read}: how ctDMR evidence is aggregated for each SV.
  • --min_overlap_pct: minimum overlap fraction required to keep assigned_code.
  • --min_agreement_pct: minimum majority agreement required to keep assigned_code.

Defaults are strict:

  • --evidence_mode all_rows (uses every supporting-read x ctDMR row; no per-read vote collapse)
  • --min_agreement_pct 1.0 (any conflicting code makes assigned_code empty / unreliable)

Conflict rule:

  • assigned_code is forced empty when evidence has a hard conflict (has_hard_conflict=True), i.e. code constraints intersect to an empty set (for example 1110 with 0001 in the same schema).

How hierarchical codes are handled

  1. One BED/TSV from find is loaded.
  2. Regions are filtered by SV proximity with --window.
  3. Every kept region is processed independently to generate per-read codes.
  4. code_order defines the shared leaf-level bit schema.
  5. best_group_leaves defines which bits are set for the target cluster in each DMR.
  6. During SV assignment, reads are linked to SVs by chromosome-aware interval matching (--window), then evidence is aggregated by --evidence_mode (all_rows by default; per_read is optional).

svanno: recompute SV-level assignment from precomputed read classifications

Use when you already have reads_classification.tsv and want to regenerate SV summaries.

Example:

sniffcell svanno \
  -v sample.vcf.gz \
  -i anno_out/reads_classification.tsv \
  -w 10000 \
  --evidence_mode all_rows \
  --min_agreement_pct 1.0 \
  -o anno_out

Output:

  • sv_assignment.tsv
  • sv_assignment_readable.tsv
  • sv_assignment_readable_long.tsv

Readable summary columns include:

  • id, sv_chr, sv_pos, sv_len, vaf
  • n_supporting, n_overlapped, overlap_pct, majority_pct
  • classified_celltypes, classified_celltype_count
  • classified_celltype_counts, classified_celltype_fractions, classification_summary
  • is_multi_celltype_link

sv_assignment.tsv also includes:

  • has_hard_conflict: whether constraints are mutually incompatible.
  • intersection_code: bitwise intersection of observed constraints in the dominant schema.

Long-format columns include:

  • id, sv_chr, sv_pos, sv_len
  • celltype, rank, supporting_read_count, supporting_read_fraction
  • n_supporting, n_overlapped, overlap_pct

viz: visualize one SV with reads and ctDMR overlap

Generate a figure (PNG/PDF) centered on one SV ID, showing:

  • all reads in SV +/- window (supporting reads highlighted),
  • SV interval,
  • overlapping ctDMRs from a find BED/TSV.
  • all cell-type methylation values on those ctDMRs from mean_* columns (heatmap panel).

Simple example (from an anno output folder):

sniffcell viz \
  --anno_output anno_out \
  -s sniffles.SV123 \
  -o anno_out/sniffles.SV123

Outputs:

  • Default output: anno_out/sniffles.SV123.png (or .pdf)
  • Add --export_tables if you also want TSV outputs (.summary.tsv, .supporting_reads_assignment.tsv, .supporting_reads_ctdmr_methylation.tsv)

dmsv: test differential methylation around SVs

Computes per-CpG statistics between supporting and non-supporting reads near each SV.

Example:

sniffcell dmsv \
  -i sample.bam \
  -v sample.vcf.gz \
  -r ref.fa \
  -o dmsv_out \
  -m 3 \
  -f 1000 \
  -c 5 \
  -t 8

Outputs:

  • dmsv_out/significant_SVs.tsv: per-SV summary including significance counts and effect summaries.
  • dmsv_out/sv_details/<sv_id>.tsv.gz: per-CpG stats table for each SV.

Current implementation note:

  • dmsv parses --test_type but the current backend path uses consistency-aware MWU screening in statistical_test_around_sv.py.

deconv

deconv CLI arguments exist but implementation is currently a placeholder (deconv_main only prints arguments).

Practical example

sniffcell anno \
  -i data/sample.bam \
  -v data/sample.vcf.gz \
  -b dmrs/pbmc_hierarchy.tsv \
  -o results/anno.w10000 \
  -r refs/GRCh38.fa \
  -w 10000 \
  -t 8

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sniffcell-0.6.0.tar.gz (654.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sniffcell-0.6.0-py3-none-any.whl (94.7 kB view details)

Uploaded Python 3

File details

Details for the file sniffcell-0.6.0.tar.gz.

File metadata

  • Download URL: sniffcell-0.6.0.tar.gz
  • Upload date:
  • Size: 654.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sniffcell-0.6.0.tar.gz
Algorithm Hash digest
SHA256 b5abc02752903aa8a618d60ac639616f8e50f9e56269670c3189d2ea9cab3272
MD5 88995611a5e0c5db332d1b6155219ece
BLAKE2b-256 253a18c4262ba3ee82f68a219ba5289ee883d273d3d48b3458291e51abacaced

See more details on using hashes here.

Provenance

The following attestation bundles were made for sniffcell-0.6.0.tar.gz:

Publisher: python-publish.yml on Fu-Yilei/SniffCell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sniffcell-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: sniffcell-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 94.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sniffcell-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8fa55537ec945328daf06b592674d13c95fa9f82cf4867b14552ca5cb6f3056
MD5 1f42787ef3667dad975e2fe797e464b7
BLAKE2b-256 95e7232f19fcf1bb97a2f593dee63bf2ed70449aee09f67472a96784231528bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for sniffcell-0.6.0-py3-none-any.whl:

Publisher: python-publish.yml on Fu-Yilei/SniffCell

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page