Skip to main content

An Hi-C tool for cutting sequences using specified enzymes

Project description

CUTSITE SCRIPT README

Overview

Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

Features

  • Find and Utilize Restriction Enzyme Sites: Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

  • Fragmentation: Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

  • Multi-threading: Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

  • Custom Modes: Supports different pairing modes for sequence fragments.

Installation

Ensure you have Python 3 installed along with the required dependencies:

sudo apt-get install pigz
pip install parasplit

Usage

The script can be executed from the command line with various arguments to customize its behavior.

Command-Line Arguments

  • --source_forward (str): Input file path for forward reads. Default is ../data/R1.fq.gz.

  • --source_reverse (str): Input file path for reverse reads. Default is ../data/R2.fq.gz.

  • --output_forward (str): Output file path for processed forward reads. Default is ../data/output_forward.fq.gz.

  • --output_reverse (str): Output file path for processed reverse reads. Default is ../data/output_reverse.fq.gz.

  • --list_enzyme (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

  • --mode (str): Mode of pairing fragments. Options are all or fr. Default is fr.

  • --seed_size (int): Minimum length of fragments to keep. Default is 20.

  • --num_threads (int): Number of threads to use for processing. Default is 8.

  • --borderless: Non conservation of ligations sites

Example Command

parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8

Main Script

  • Pretreatment: Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

  • Read: Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

  • Frag: Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

  • WriteAndControl: Stream writing from data from the output queue and compression in parallel

Project architecture

Schéma de l'architecture

Schéma de l'architecture - Licence : CC BY-NC 4.0

Implementation Details

  • The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
  • Signal handlers are implemented to ensure graceful termination of processes.
  • The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
  • Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.

Dependencies

  • Python 3
  • pigz

Testing

Documentation of the tests/ Directory

File test_main.py

  • Purpose: This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.

  • Examples of Tests:

    • test_process_file: Verifies that the cut function correctly processes an input file and generates the expected output file.
    • Additional tests specific to the different functionalities (modes) of the program.

Directory input_data/

  • Purpose: Contains specific input data used to test various configurations of your program.
  • Examples:
    • R1.fq.gz, R2.fq.gz: Compressed FASTQ files containing DNA sequences for testing fragmentation.

Directory output_data/

  • Purpose: Contains the expected results of the tests.
  • Examples:
    • output_ref_R1.fq.gz, output_ref_R2.fq.gz: Compressed FASTQ files representing the expected result after processing by your program.

Running Tests

To run the tests, use the following command:

pytest tests/

This command will execute all tests defined in the tests/ directory and ensure that your program functions correctly.

The tree structure of my project :

		├── myproject/
		│   ├── __init__.py
		│   ├── main.py
		│   ├── Frag.py
		│   ├── Read.py
		│   ├── Pretreatment.py
		│   └── WriteAndControl.py
		├── pyproject.toml
		├── requirements-dev.txt
		├── docs/
		│   ├── requirements.txt
		├── test/
		│   ├── __init__.py
		│   ├── test_main.py	
		│   ├── input_data/
		│   │   ├── R1.fq.gz
		│   │   └── R2.fq.gz
		│   └── output_data/
		│       ├── output_ref_R1.fq.gz
		│       ├── output_ref_R2.fq.gz
		│       ├── output_ref_all_R1.fq.gz
		│       └── output_ref_all_R2.fq.gz
		└── README.md

Contact

For questions or issues, please contact samir.bertache.djenadi@gmail.com.


This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parasplit-1.1.2.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

parasplit-1.1.2-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file parasplit-1.1.2.tar.gz.

File metadata

  • Download URL: parasplit-1.1.2.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for parasplit-1.1.2.tar.gz
Algorithm Hash digest
SHA256 c1652e0fa9c5f204b54a994b45efebebd33fb6b280865820d32833e58dc481ad
MD5 2eea274635d29557ce6f3567f588ad8e
BLAKE2b-256 fc9fdbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3

See more details on using hashes here.

File details

Details for the file parasplit-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: parasplit-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for parasplit-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8f5c762ba355ad2c3d574cda4e8b93d9a664d0e5e5c93633369f5f6f0378aefe
MD5 169d5c78be6406e1678d7e1a2c43b6eb
BLAKE2b-256 ca3e73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page