An Hi-C tool for cutting sequences using specified enzymes
Project description
CUTSITE SCRIPT README
Overview
Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.
Features
-
Find and Utilize Restriction Enzyme Sites: Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.
-
Fragmentation: Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.
-
Multi-threading: Efficiently processes large datasets by utilizing multiple threads for decompression and compression.
-
Custom Modes: Supports different pairing modes for sequence fragments.
Installation
Ensure you have Python 3 installed along with the required dependencies:
sudo apt-get install pigz
pip install parasplit
Usage
The script can be executed from the command line with various arguments to customize its behavior.
Command-Line Arguments
-
--source_forward
(str): Input file path for forward reads. Default is../data/R1.fq.gz
. -
--source_reverse
(str): Input file path for reverse reads. Default is../data/R2.fq.gz
. -
--output_forward
(str): Output file path for processed forward reads. Default is../data/output_forward.fq.gz
. -
--output_reverse
(str): Output file path for processed reverse reads. Default is../data/output_reverse.fq.gz
. -
--list_enzyme
(str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found." -
--mode
(str): Mode of pairing fragments. Options areall
orfr
. Default isfr
. -
--seed_size
(int): Minimum length of fragments to keep. Default is 20. -
--num_threads
(int): Number of threads to use for processing. Default is 8. -
--borderless
: Non conservation of ligations sites
Example Command
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
Main Script
-
Pretreatment: Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.
-
Read: Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue
-
Frag: Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue
-
WriteAndControl: Stream writing from data from the output queue and compression in parallel
Project architecture
Schéma de l'architecture - Licence : CC BY-NC 4.0
Implementation Details
- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
- Signal handlers are implemented to ensure graceful termination of processes.
- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.
Dependencies
- Python 3
- pigz
Testing
Documentation of the tests/
Directory
File test_main.py
-
Purpose: This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.
-
Examples of Tests:
test_process_file
: Verifies that thecut
function correctly processes an input file and generates the expected output file.- Additional tests specific to the different functionalities (modes) of the program.
Directory input_data/
- Purpose: Contains specific input data used to test various configurations of your program.
- Examples:
R1.fq.gz
,R2.fq.gz
: Compressed FASTQ files containing DNA sequences for testing fragmentation.
Directory output_data/
- Purpose: Contains the expected results of the tests.
- Examples:
output_ref_R1.fq.gz
,output_ref_R2.fq.gz
: Compressed FASTQ files representing the expected result after processing by your program.
Running Tests
To run the tests, use the following command:
pytest tests/
This command will execute all tests defined in the tests/
directory and ensure that your program functions correctly.
The tree structure of my project :
├── myproject/
│ ├── __init__.py
│ ├── main.py
│ ├── Frag.py
│ ├── Read.py
│ ├── Pretreatment.py
│ └── WriteAndControl.py
├── pyproject.toml
├── requirements-dev.txt
├── docs/
│ ├── requirements.txt
├── test/
│ ├── __init__.py
│ ├── test_main.py
│ ├── input_data/
│ │ ├── R1.fq.gz
│ │ └── R2.fq.gz
│ └── output_data/
│ ├── output_ref_R1.fq.gz
│ ├── output_ref_R2.fq.gz
│ ├── output_ref_all_R1.fq.gz
│ └── output_ref_all_R2.fq.gz
└── README.md
Contact
For questions or issues, please contact samir.bertache.djenadi@gmail.com.
This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file parasplit-1.1.2.tar.gz
.
File metadata
- Download URL: parasplit-1.1.2.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1652e0fa9c5f204b54a994b45efebebd33fb6b280865820d32833e58dc481ad |
|
MD5 | 2eea274635d29557ce6f3567f588ad8e |
|
BLAKE2b-256 | fc9fdbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3 |
File details
Details for the file parasplit-1.1.2-py3-none-any.whl
.
File metadata
- Download URL: parasplit-1.1.2-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f5c762ba355ad2c3d574cda4e8b93d9a664d0e5e5c93633369f5f6f0378aefe |
|
MD5 | 169d5c78be6406e1678d7e1a2c43b6eb |
|
BLAKE2b-256 | ca3e73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833 |