Skip to main content

CLI tools to process mapped Hi-C data

Project description

pairtools

Documentation Status Build Status Join the chat on Slack DOI

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

  • detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
  • sort .pairs files for downstream analyses
  • detect, tag and remove PCR/optical duplicates
  • generate extensive statistics of Hi-C datasets
  • select Hi-C pairs given flexibly defined criteria
  • restore .sam alignments from Hi-C pairs
  • annotate restriction digestion sites
  • get the mutated positions in Hi-C pairs

To get started:

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

pairtools produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks. The full list of possible extra columns is provided in the pairtools format specification.

Installation

Requirements:

  • Python 3.x
  • Python packages cython, pysam, bioframe, pyyaml, numpy, scipy, pandas and click.
  • Command-line utilities sort (the Unix version), samtools and bgzip (shipped with samtools). If available, pairtools can compress outputs with pbgzip and lz4.

For the full list of recommended versions, see the requirements section in the pyproject.toml.

There are three options for installing pairtools:

  1. We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel:

$ conda install -c conda-forge -c bioconda pairtools
  1. Alternatively, install non-Python dependencies (sort, samtools, bgzip, pbgzip and lz4) separately and download pairtools with Python dependencies from PyPI using pip:
$ pip install pairtools
  1. Finally, when the two options above don't work or when you want to modify pairtools, build pairtools from source via pip's "editable" mode:
$ pip install numpy cython pysam 
$ git clone https://github.com/open2c/pairtools
$ cd pairtools
$ pip install -e ./ --no-build-isolation

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:

$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:

$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam 

Inspect the resulting table:

$ less MATalpha_R1.pairs.gz

Pipelines

  • We provide a simple working example of a mapping bash pipeline in /examples/.
  • distiller is a powerful Hi-C data analysis workflow, based on pairtools and nextflow.

Tools

  • parse: read .sam/.bam files produced by bwa and form Hi-C pairs

    • form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
    • report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
    • perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
    • form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
    • report .sam tags or mutations of the alignments;
    • print the .sam header as #-comment lines at the start of the file.
  • parse2: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks

    • identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
    • annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
    • perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;
    • report index of the pair in the complex walk;
    • make combinatorial expansion of pairs produced from the same walk;
  • sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).

  • merge: merge sorted .pairs files

    • merge sort .pairs;
    • combine the .pairs headers from all input files;
    • check that each .pairs file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
  • select: select pairs according to specified criteria

    • select pairs entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
    • optionally print the non-matching entries into a separate file.
  • dedup: remove PCR duplicates from a sorted triu-flipped .pairs file

    • remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
    • optionally output the PCR duplicate entries into a separate file;
    • detect optical duplicates from the original Illumina read ids;
    • apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication;
    • output yaml or convenient tsv deduplication stats into text file.
    • NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
  • maskasdup: mark all pairs in a pairsam as Hi-C duplicates

    • change the field pair_type to DD;
    • change the pair_type tag (Yt:Z:) for all sam alignments;
    • set the PCR duplicate binary flag for all sam alignments (0x400).
  • split: split a .pairsam file into .pairs and .sam.

  • flip: flip pairs to get an upper-triangular matrix

  • header: manipulate the .pairs/.pairsam header

    • generate new header for headerless .pairs file
    • transfer header from one .pairs file to another
    • set column names for the .pairs file
    • validate that the header corresponds to the information stored in .pairs file
  • stats: calculate various statistics of .pairs files

  • restrict: identify the span of the restriction fragment forming a Hi-C junction

  • phase: phase pairs mapped to a diploid genome

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly.

$ git clone https://github.com/open2c/pairtools.git
$ cd pairtools
$ pip install -e .

Citing pairtools

Open2C*, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna*, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pairtools-1.1.2.tar.gz (1.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pairtools-1.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

pairtools-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pairtools-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pairtools-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pairtools-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file pairtools-1.1.2.tar.gz.

File metadata

  • Download URL: pairtools-1.1.2.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pairtools-1.1.2.tar.gz
Algorithm Hash digest
SHA256 c6da509072fb88f82af38e2ddbc5edf17f71f19df54330e7c5f83883732206dd
MD5 7d372447f6954f141acf188eb3538760
BLAKE2b-256 31a95602ec186babe2702af93f4f7957cadd36945accb94315076105eb228ad1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2.tar.gz:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pairtools-1.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pairtools-1.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cf7651127c3884ece33ca6ec1379ef6c7a2a332330329126a474062139e6a1d7
MD5 8734380179f588ce515be4a421b89b5b
BLAKE2b-256 1e294324528a3faf020400f09b1b1a9a941b301b900b25682abaa3d724f31de2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pairtools-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pairtools-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0fbf68190e680829852e22d456c8807769f3855ec7e0da312a3cec6ab89d0332
MD5 663f5c53c54df7ace99b023ad134caf6
BLAKE2b-256 f3963318ca5b1e45063558f658696644e2842aede5a3768a5bf6d45e87670628

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pairtools-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pairtools-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81824a3195fd531ac06872cf45af88adcdd5a7579580d48a87b7585d81ad14e2
MD5 2cc9ade8ebcfe9d72df55dd424569175
BLAKE2b-256 96c2c582e569b292c76550dfdade2413f190cf52b6fe3cd44c0f57086dbafc5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pairtools-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pairtools-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d35c2cddb38055075d70bf87837ad5e1930573f1c0c97bb0f0bfc8341b6b0eca
MD5 2fa0959dab6cefffcd0fdb256b97599f
BLAKE2b-256 6d2d440ae1686311dff1c172452d083d730b0a663395665052759787fe27f48c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pairtools-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pairtools-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 55f1f855105ce5fb7f92f4f267a06af00cea3766fd8670d1fd0d3f46f1c8b33f
MD5 3ec4f4dacc4af6348407585fb4acbf70
BLAKE2b-256 efb363ac0d89d01ae2cbf7c7a296289d9dbdfaf105e03107c5bf0fa118bc4298

See more details on using hashes here.

Provenance

The following attestation bundles were made for pairtools-1.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on open2c/pairtools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page