Skip to main content

Tool fo low coverage region correction

Project description

CLCR

License: GPL v3

The CLCR approach

To improve the completeness of draft genome assemblies we created the tool CLCR, which stands for Correction of Low Coverage Regions. The tool provides functions for short indel detection and correction in high coverage genome assemblies, in the form of error detection in typically error prone assembly regions with a significantly lower read coverage. The detection of frameshifts is implemented by the integration of the alignment tool Diamond in combination with the usage of a protein database with the sequences of closely related organisms. The frameshift detection by Diamond is followed by the localization of the frameshifts in the underlying genome in combination with extensive filter heuristics to avoid false correction. Reading frames that are disturbed by frameshift mutations, putatively caused by a sequencing error, are locally healed by the insertion of N’s in the assembly.

Installation

CLCR can be retrieved as a PyPI package, it requires a Python version of 3.7 or higher and a Diamond installation is mandatory.

The program can be installed with the following command:

# Install pip, if nescessary
sudo apt update
sudo apt install python3-pip

# Update pip
pip install --upgrade pip

# Install CLCR
pip install clcr

If there are problems with that, the code can be directly cloned from github:

# Install git, if nescessary
sudo apt install git-all

# Clone CLCR code from github
git clone https://github.com/Johannes-Zi/CLCR.git

# Move to the cloned directory
pip install .
# Or manually add the program to PATH when there are problems with pip

Quickstart

Query creation with clcr.query_creation

The first step is the creation of the query sequences for the Diamond blastx runs. For this, the function detects the regions with low read coverages in a given pbc (per base coverage) file. The sequences of the detected regions are extracted from the handed over assembly and stored as .fasta query files in the query_files directory of the handed over project. The query_files directory will be overwritten! A log file with run information and a original_low_cov_regions.tsv with the original detected low coverage regions before the merging step are stored at the storage_files dir.

ATTENTION!: old query directory and DIAMOND output directory of current project dir is overwritten as preparation for a new cluster run!

There are three mandatory arguments required:

  • genome assembly
  • the matching per base coverage file
  • project directory for the CLCR run

There are the following options:

  -h, --help            show the help message and exit

required arguments:
  -p PROJECT_DIR, --project_dir PROJECT_DIR
                        Path of the project directory
                        
  -c COV_FILE_PATH, --cov_file_path COV_FILE_PATH
                        Path of the coverage file
                        
  -a ASSEMBLY_FILE, --assembly_file ASSEMBLY_FILE
                        Path of the assembly file

optional arguments:
  --low_cov_start LOW_COV_START
                        Threshold for detecting a low cov region
                        
  --low_cov_end LOW_COV_END
                        Threshold for ending a low cov region
                        
  --min_query_len MIN_QUERY_LEN
                        Minimum query length
                        
  --queries_per_file QUERIES_PER_FILE
                        Queries sequences per query file
                        
  --verbose             Run information is print in the command line

Diamond cluster run clcr.cluster_run

After the query files are created, the next step is to perform the Diamond blastx searches against a protein database with the sequences of closely related organisms. This can be done locally on a single computer, or on a computer cluster. The usage of a cluster is highly recommended, the CLCR workflow has to be manually adapted to this. The clcr.cluster_run function creates a slurm-file for the Diamond blastx cluster run of the handed over CLCR project. The jobs are started automatically on the cluster, when the --auto_run parameter is activated. The slurm log files are stored in the slurm_files dir, a log file for the CLCR run is stored in the storage_files dir.

ATTENTION!: Use this function only when your cluster supports slurm and adapt the slurm file manually to your local circumstances! (In this case auto submission is not recommended!)

There are the following options:

  -h, --help            show this help message and exit

required arguments:
  -p PROJECT_DIR, --project_dir PROJECT_DIR
                        Path of the project directory
                        
  -c PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
                        Path of the protein database

optional arguments:
  --auto_run [AUTO_RUN]
                        Activate automatic slurm job submission, when parameter is True.
  
  --verbose             Run information is print in the command line

Creation of healed assembly version clcr.assembly_healing

The last step in the analysis is the creation of an adapted assembly version. For this the detected frameshifts in the Diamond blastx output are evaluated, extensively filtered and used to created a adapted assembly version with locally healed reading frames. The healed assembly version is stored in the healed_assembly dir, and log file for the CLCR run is stored in the storage_files dir.

There are the following options:

  -h, --help            show this help message and exit

required arguments:
  -p PROJECT_DIR, --project_dir PROJECT_DIR
                        Path of the project directory
                        
  -c UNHEALED_ASSEMBLY, --unhealed_assembly UNHEALED_ASSEMBLY
                        Path of the original unhealed assembly file

optional arguments:
  --dynamic_threshold_dist DYNAMIC_THRESHOLD_DIST
                        The max_detect_distance defines the distance from a detected frameshift position to
                        the original low cov. region, where a frameshift is still considered and not
                        excluded in the further analysis.
                        
  --verbose             Run information is print in the command line

Contact

For error reports, questions, comments or suggestions contact us via email

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CLCR-1.0.0.tar.gz (69.7 kB view details)

Uploaded Source

Built Distribution

CLCR-1.0.0-py3-none-any.whl (73.8 kB view details)

Uploaded Python 3

File details

Details for the file CLCR-1.0.0.tar.gz.

File metadata

  • Download URL: CLCR-1.0.0.tar.gz
  • Upload date:
  • Size: 69.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for CLCR-1.0.0.tar.gz
Algorithm Hash digest
SHA256 40501c2f2a75698704019686b65cad51053d6e93399772adf83fa7b6bdd307a7
MD5 45cf8fe9dc65197bd2e6ea6df6482be7
BLAKE2b-256 8c75f4e11b116af3fd2a5a4377c9c897ba96a4ae408ea36de69486cd0ad758da

See more details on using hashes here.

File details

Details for the file CLCR-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: CLCR-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 73.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for CLCR-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d599b215b839bdabc2ee4aa7c0fd17e651e891c586ee7693b2aa09d8812011a0
MD5 377e0a8a667dcf6dcb7441f0282a7ecb
BLAKE2b-256 db7bd57604614bdb3fbf4b683e8f7088c3c3c54e79e46e98cae26ca8e602b5df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page