Skip to main content

Utilities for the genomic chain format

Project description

chaintools_bio: utilities for the genomic chain format

This toolkit provides utilities to process whole-genome maps in the (chain format).

A chain can be used to convert genomic information from the "target" coordinate system to the "query" coordinate system. For example, in the hg38ToHg19.over.chain file hg38 uses the target fields and hg19 uses the query fields. Lift-over software such as UCSC LiftOver, CrossMap, or levioSAM can be used to convert a locus in the hg38 coordinates to hg19's using the chain file.

Utilities supported

Install

git clone git@github.com:milkschen/chaintools_bio.git
# Option 1: pip
pip install -e .
# Option 2: uv
uv pip install -e .
  • Python 3.8+
  • See INSTALL.md for dependencies and installation instructions.

Usage

Annotate

Annotate a chain file:

  • Specify the contig and start/end positions of each segment
  • Calculate the sequence identity of each segment (optional)
  • Write liftable regions to a pair of BED files (one for target and one for query) (optional)
# Annotate contig and positions
chaintools_bio annotate -c <in.chain> -o <out.chain>
# Add identity
chaintools_bio annotate -c <in.chain> -o <out.chain> -fs <target.fasta> -ft <query.fasta>
# Also write liftable regions to BED files
chaintools_bio annotate -c <in.chain> -o <out.chain> -fs <target.fasta> -ft <query.fasta> -b <bed_prefix>

Convert to BED

Convert a chain file to the BED format using either target or query coordinates

# Report using the target coordinates
chaintools_bio to_bed -c <in.chain> -o <out.bed> --coord target
# Report using the query coordinates
chaintools_bio to_bed -c <in.chain> -o <out.bed> --coord query

Convert to PAF

Convert a chain file to the PAF format.

The target chain is converted as the target sequence, and the query chain is converted as the query sequence.

If both target.fa and query.fa are provided, this script checks the reference sequences and updates the cigar (cg:Z tag) using [=XID]+ operators. Otherwise, it uses [MID]+ and [X]+ at chain break points. A breakpoint is a gap wrt both target and query, e.g., 149 341 2894.

chaintools_bio to_paf -c <in.chain> -o <out.paf> [-t <target.fa> -q <query.fa>]

Convert to SAM

Convert a chain file to the SAM format, using the target fasta file for the genome from which the chain lifts, and the query fasta file for the genome to which the chain lifts.

chaintools_bio to_sam -c <in.chain> -t <target.fa> -q <query.fa> -o <out.sam>

Note: For a chain file used to convert from a target genome's coordinates to a query genome's coordinates, the chain header lines have target data in the second through sixth fields, and query data in the seventh through eleventh fields.

Convert to VCF

Convert a chain file to the VCF format, using the target fasta file for the genome from which the chain lifts, and the query fasta file for the genome to which the chain lifts.

chaintools_bio to_vcf -c <in.chain> -t <target.fa> -q <query.fa> -o <out.vcf>

Filter

Filter a chain file by critera including chain sizes and overlap status. The size of a chain is the sum of all its segments, including matches (=) and mismatches (X). The overlap filter makes sure no chains overlap wrt either target or query references. If two chains overlap, the smaller one is removed.

# Filter by chain size
chaintools_bio chain_filter -c <in.chain> -o <out.filtered.chain> -s <size>
# Filter by both chain size and overlap status
chaintools_bio chain_filter -c <in.chain> -o <out.filtered.chain> -u -oc <out.overlapped.chain> -s <size>

Invert

Invert a chain file by switching the target and query references

chaintools_bio invert -c <a_to_b.chain> -o <b_to_a.chain>

Split

Split a chain at large gaps or breakpoints. A breakpoint is a gap wrt both target and query, e.g., 149 341 2894.

chaintools_bio split -c <in.chain> -o <split.chain> [--min_gap <INT> --min_bp <INT>]

Stats

Calculate summary statistics of a chain file

chaintools_bio stats -c <in.chain> -o <stats.tsv>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chaintools_bio-0.4.1.tar.gz (321.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chaintools_bio-0.4.1-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file chaintools_bio-0.4.1.tar.gz.

File metadata

  • Download URL: chaintools_bio-0.4.1.tar.gz
  • Upload date:
  • Size: 321.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chaintools_bio-0.4.1.tar.gz
Algorithm Hash digest
SHA256 41e23f82e936d888d0729ba73655a805f06f938ea1de2164dd55b989a426d430
MD5 56dd48df1f7a4a014d1636fc369be14f
BLAKE2b-256 446eabd1e059e06975385c0738556ebdc2469cf1a14c98d433fcc001cc38928e

See more details on using hashes here.

File details

Details for the file chaintools_bio-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: chaintools_bio-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chaintools_bio-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 589f0bf2ae82d48f495b5cf6d7733ad1881d4bfef68000f042c8a1e7c419f700
MD5 d0f708b3fbd682cda97c5ae76de99549
BLAKE2b-256 161f260841c19235592dca9f805e842140235f4d268401f20371600f1fc3d9c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page