Skip to main content

An Hi-C tool for cutting sequences using specified enzymes

Project description

CUTSITE SCRIPT README

Overview

Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

Features

  • Find and Utilize Restriction Enzyme Sites: Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

  • Fragmentation: Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

  • Multi-threading: Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

  • Custom Modes: Supports different pairing modes for sequence fragments.

Installation

Ensure you have Python 3 installed along with the required dependencies:

sudo apt-get install pigz
pip install parasplit

Usage

The script can be executed from the command line with various arguments to customize its behavior.

Command-Line Arguments

  • --source_forward (str): Input file path for forward reads. Default is ../data/R1.fq.gz.

  • --source_reverse (str): Input file path for reverse reads. Default is ../data/R2.fq.gz.

  • --output_forward (str): Output file path for processed forward reads. Default is ../data/output_forward.fq.gz.

  • --output_reverse (str): Output file path for processed reverse reads. Default is ../data/output_reverse.fq.gz.

  • --list_enzyme (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

  • --mode (str): Mode of pairing fragments. Options are all or fr. Default is fr.

  • --seed_size (int): Minimum length of fragments to keep. Default is 20.

  • --num_threads (int): Number of threads to use for processing. Default is 8.

  • --borderless: Non conservation of ligations sites

Example Command

parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8

Main Script

  • Pretreatment: Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

  • Read: Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

  • Frag: Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

  • WriteAndControl: Stream writing from data from the output queue and compression in parallel

Project architecture

Schéma de l'architecture

Schéma de l'architecture - Licence : CC BY-NC 4.0

Implementation Details

  • The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
  • Signal handlers are implemented to ensure graceful termination of processes.
  • The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
  • Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.

Dependencies

  • Python 3
  • pigz

Testing

Documentation of the tests/ Directory

File test_main.py

  • Purpose: This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.

  • Examples of Tests:

    • test_process_file: Verifies that the cut function correctly processes an input file and generates the expected output file.
    • Additional tests specific to the different functionalities (modes) of the program.

Directory input_data/

  • Purpose: Contains specific input data used to test various configurations of your program.
  • Examples:
    • R1.fq.gz, R2.fq.gz: Compressed FASTQ files containing DNA sequences for testing fragmentation.

Directory output_data/

  • Purpose: Contains the expected results of the tests.
  • Examples:
    • output_ref_R1.fq.gz, output_ref_R2.fq.gz: Compressed FASTQ files representing the expected result after processing by your program.

Running Tests

To run the tests, use the following command:

pytest tests/

This command will execute all tests defined in the tests/ directory and ensure that your program functions correctly.

The tree structure of my project :

		├── myproject/
		│   ├── __init__.py
		│   ├── main.py
		│   ├── Frag.py
		│   ├── Read.py
		│   ├── Pretreatment.py
		│   └── WriteAndControl.py
		├── pyproject.toml
		├── requirements-dev.txt
		├── docs/
		│   ├── requirements.txt
		├── test/
		│   ├── __init__.py
		│   ├── test_main.py	
		│   ├── input_data/
		│   │   ├── R1.fq.gz
		│   │   └── R2.fq.gz
		│   └── output_data/
		│       ├── output_ref_R1.fq.gz
		│       ├── output_ref_R2.fq.gz
		│       ├── output_ref_all_R1.fq.gz
		│       └── output_ref_all_R2.fq.gz
		└── README.md

Contact

For questions or issues, please contact samir.bertache.djenadi@gmail.com.


This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parasplit-1.1.2.tar.gz (14.3 kB view hashes)

Uploaded Source

Built Distribution

parasplit-1.1.2-py3-none-any.whl (15.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page