Skip to main content

An Hi-C tool for cutting sequences using specified enzymes

Project description

[pipeline status] [coverage report]

PARASPLIT :

Overview

Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

Features

  • Find and Utilize Restriction Enzyme Sites: Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

  • Fragmentation: Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

  • Multi-threading: Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

  • Custom Modes: Supports different pairing modes for sequence fragments.

Installation

Ensure you have Python 3 installed along with the required dependencies:

sudo apt-get install pigz
pip install parasplit

Usage

The script can be executed from the command line with various arguments to customize its behavior.

Command-Line Arguments

  • --source_forward (str): Input file path for forward reads. Default is ../data/R1.fq.gz.

  • --source_reverse (str): Input file path for reverse reads. Default is ../data/R2.fq.gz.

  • --output_forward (str): Output file path for processed forward reads. Default is ../data/output_forward.fq.gz.

  • --output_reverse (str): Output file path for processed reverse reads. Default is ../data/output_reverse.fq.gz.

  • --list_enzyme (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

  • --mode (str): Mode of pairing fragments. Options are all or fr. Default is fr.

  • --seed_size (int): Minimum length of fragments to keep. Default is 20.

  • --num_threads (int): Number of threads to use for processing. Default is 8.

  • --borderless: Non conservation of ligations sites

Example Command

parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8

Main Script

  • Pretreatment: Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

  • Read: Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

  • Frag: Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

  • WriteAndControl: Stream writing from data from the output queue and compression in parallel

Project architecture

Schéma de l'architecture

Schéma de l'architecture - Licence : CC BY-NC 4.0

Dependencies

  • pigz

The tree structure of my project :

		├── myproject/
		│   ├── __init__.py
		│   ├── main.py
		│   ├── Frag.py
		│   ├── Read.py
		│   ├── Pretreatment.py
		│   └── WriteAndControl.py
		├── pyproject.toml
		├── requirements-dev.txt
		├── docs/
		│   ├── requirements.txt
		├── test/
		│   ├── __init__.py
		│   ├── test_main.py	
		│   ├── input_data/
		│   │   ├── R1.fq.gz
		│   │   └── R2.fq.gz
		│   └── output_data/
		│       ├── output_ref_R1.fq.gz
		│       ├── output_ref_R2.fq.gz
		│       ├── output_ref_all_R1.fq.gz
		│       └── output_ref_all_R2.fq.gz
		└── README.md

Contact

For questions or issues, please contact samir.bertache.djenadi@gmail.com.


This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parasplit-1.1.4.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parasplit-1.1.4-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file parasplit-1.1.4.tar.gz.

File metadata

  • Download URL: parasplit-1.1.4.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for parasplit-1.1.4.tar.gz
Algorithm Hash digest
SHA256 2fce7575973af8e654a3055feb731589678a812adb85956590f2408da957903e
MD5 a4a78635ac7bba0633b29786ca8b3c3e
BLAKE2b-256 21c79a3bf9b5ac2797a1e20f5265662786b07a44bc0ebc1ab22b702fa8a5d6a7

See more details on using hashes here.

File details

Details for the file parasplit-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: parasplit-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for parasplit-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b7fc3ce006509171b7a29d568bc00f66075e7374f645ef37bdfd3cbc59631a4c
MD5 de5a3a9b18eb3c7936c340bc9f606870
BLAKE2b-256 2a0852ade8ca3786c7f40b10abf64d4101cbbdcc759a791616e0c5d51d7d6483

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page