Skip to main content

Preprocess micros-C data for increasing the number of contact identified

Project description

Microsplit

Microsplit is a command-line tool designed for processing Micro-C data by identifying and managing chimeric reads in BAM files. It follows the logic and structure of the Parasplit tool but is tailored for Micro-C data. The tool reads alignment files (SAM, BAM, or CRAM) using pysam and identifies events of soft-clipping or hard-clipping. The identified clipping points are treated as restriction sites to generate new pairs of sequences.

Features

  • Parallel Processing: Microsplit utilizes parallel processing to enhance performance and efficiency.
  • Soft-Clipping and Hard-Clipping Detection: It detects soft-clipping ('S') and hard-clipping ('H') in CIGAR strings to identify chimeric reads.
  • Fragment Generation: Generates new fragments by considering the identified clipping points as restriction sites.
  • Error Margin Handling: Adds a fixed number of base pairs to new fragments to account for potential over-mapping by Bowtie2, ensuring more accurate downstream analysis.
  • Output Paired Reads: Outputs both end-to-end aligned pairs and newly generated fragment pairs.

Installation

Microsplit is available on PyPI and can be installed using pip:

pip install microsplit

Usage

Before using Microsplit, you need to perform an initial alignment of reads using Bowtie2 with the --local-very-sensitive mode and the -xeq option to obtain explicit CIGAR strings. Below is an example of how to use Microsplit from the command line:

microsplit --bam_for_file path/to/forward.bam \
           --bam_rev_file path/to/reverse.bam \
           --output_forward path/to/output_forward.fastq.gz \
           --output_reverse path/to/output_reverse.fastq.gz \
           --num_threads 8 \
           --seed_size 20 \
           --length_added 10

Command-Line Arguments

  • --bam_for_file: Path to the forward BAM file.
  • --bam_rev_file: Path to the reverse BAM file.
  • --output_forward: Path to the output forward FastQ file.
  • --output_reverse: Path to the output reverse FastQ file.
  • --num_threads: Total number of threads for parallel processing.
  • --seed_size: Minimum size of a fragment to be generated.
  • --length_added: Number of base pairs added to the new fragment after soft clipping to account for potential over-mapping by Bowtie2.

Methodology

Microsplit processes Micro-C data by:

  1. Reading BAM Files: Reads the forward and reverse BAM files simultaneously.
  2. Identifying Clipping Events: Identifies soft-clipping and hard-clipping events from the CIGAR strings in the BAM files.
  3. Generating Fragments: Uses the clipping points as restriction sites to generate new fragments, adding a fixed number of base pairs (length_added) to account for potential over-mapping by Bowtie2.
  4. Outputting Paired Reads: Outputs both end-to-end aligned pairs and the newly generated fragment pairs.

The length added to new fragments helps to handle potential misalignments due to Bowtie2's tendency to over-map reads and not soft-clip enough, ensuring more accurate results.

License

Microsplit is released under the AGPLv3 license. The code is freely available on Gitbio.

Contributing

We welcome contributions! If you'd like to contribute, please fork the repository and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.

Contact

For any questions or issues, please contact samir.bertache.djenadi@gmail.com

Thank you for using Microsplit!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microsplit-1.1.0.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microsplit-1.1.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file microsplit-1.1.0.tar.gz.

File metadata

  • Download URL: microsplit-1.1.0.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for microsplit-1.1.0.tar.gz
Algorithm Hash digest
SHA256 828943c0bdd7d38ed19375fdba87a00435c1025fb65c4adf0126e1eba439822f
MD5 4ada9581ad8d49ace23c0df68b3edb82
BLAKE2b-256 bbf44cf9677b107461eb9cffbd6e66d4506c5c70a8071bfd8855bc2434305b21

See more details on using hashes here.

File details

Details for the file microsplit-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: microsplit-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for microsplit-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b81626babbe53288882c8240719b0819852dad27db665f9a4f4306b2418d2a3f
MD5 7eea1f8ca7c27068559e0b5939a2db2d
BLAKE2b-256 537c4ed663d352fb7018f8842da6ce5fcd1d78a97343c55145d9e2dd3c08f0aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page