Skip to main content

A tool for metagenomic taxonomic profiling and abundance matrix generation

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

toxolib (v0.1.11)

A Python package for metagenomic taxonomic profiling and abundance matrix generation.

Installation

Using pip

pip install toxolib

Install directly from GitHub

pip install git+https://github.com/dhruvac29/toxolib.git

Using conda

We recommend using conda to install all dependencies. An environment file is included in the package:

# Clone the repository
git clone https://github.com/dhruvac29/toxolib.git
cd toxolib

# Create and activate the conda environment
conda env create -f environment.yml
conda activate taxonomy_env

# Install the package
pip install -e .

Requirements

This package requires the following external tools to be installed and available in your PATH:

  • Kraken2
  • Bracken
  • Krona (for visualization)
  • fastp (for preprocessing)
  • bowtie2 (for host removal)
  • samtools

All these dependencies are included in the conda environment file.

Database Setup

Automated Database Setup

Toxolib provides automated database setup for both local and HPC environments.

Local Database Setup

# Set up both Kraken2 and corn genome databases
toxolib db-setup -o /path/to/databases --kraken --corn

# Set up only Kraken2 database
toxolib db-setup -o /path/to/databases --kraken

# Set up only corn genome database
toxolib db-setup -o /path/to/databases --corn

# Force re-download of databases even if they exist
toxolib db-setup -o /path/to/databases --kraken --corn --force

After setup, you should set the environment variable for Kraken2:

export KRAKEN2_DB_DIR=/path/to/databases/Kraken2_DB

HPC Database Setup

When submitting jobs to the HPC, you can automatically download and set up the databases locally and upload them to the HPC:

# Automatically download locally and upload both databases to the HPC
toxolib hpc -r sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample1_L002_R1.fastq.gz sample1_L002_R2.fastq.gz -o /path/on/hpc/output_dir \
    --setup-kraken-db --setup-corn-db

# Automatically download locally and upload only Kraken2 database
toxolib hpc -r sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample1_L002_R1.fastq.gz sample1_L002_R2.fastq.gz -o /path/on/hpc/output_dir \
    --setup-kraken-db

# Automatically download locally and upload only corn genome database
toxolib hpc -r sample1_L001_R1.fastq.gz sample1_L001_R2.fastq.gz sample1_L002_R1.fastq.gz sample1_L002_R2.fastq.gz -o /path/on/hpc/output_dir \
    --setup-corn-db

When using these options, toxolib will:

  1. Download the databases to your local machine
  2. Extract the databases locally
  3. Upload the extracted databases to the HPC
  4. Configure the Snakefile to use the correct database paths

This approach works even if your HPC has restricted internet access or firewalls that prevent direct downloads.

Manual Database Setup

If you prefer to set up the databases manually, you can follow these steps:

Kraken2 Database

You can download the standard Kraken2 database from: https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz
tar -xzf k2_standard_20240112.tar.gz -C /path/to/kraken2/database
export KRAKEN2_DB_DIR=/path/to/kraken2/database

Corn Genome Database

For host removal, you can download the corn genome reference from: https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip

wget https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip
unzip corn_db.zip -d /path/to/corn_db

Usage

Local Usage

Generate abundance matrix from raw data

toxolib abundance -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o output_directory

This will:

  1. Run Kraken2 on the raw data
  2. Run Bracken on the Kraken2 results
  3. Generate an abundance matrix from the Bracken results

Create abundance matrix from existing Bracken files

toxolib matrix -i sample1_species.bracken sample2_species.bracken -o abundance_matrix.csv

HPC Usage

Toxolib can run the analysis pipeline on an HPC cluster using SLURM for job scheduling.

1. Set up HPC connection

toxolib hpc-setup --hostname your-hpc-server.edu --username your-username --key-file ~/.ssh/id_rsa

This will save your HPC connection details to ~/.toxolib/hpc_config.yaml.

2. Run the pipeline on HPC

toxolib hpc -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o /path/on/hpc/output_dir \
    --kraken-db /path/on/hpc/kraken2_db \
    --corn-db /path/on/hpc/corn_db \
    --partition normal --threads 32 --memory 200 --time 144:00:00

This will:

  1. Upload your raw data files to the HPC
  2. Create a Snakemake workflow file
  3. Upload an environment.yml file to the HPC
  4. Submit a SLURM job to run the analysis
  5. Return a job ID for tracking
Automatic Conda Environment Creation

When submitting a job to the HPC, toxolib will automatically:

  1. Upload a conda environment.yml file to the HPC
  2. Create a conda environment in the output directory if it doesn't exist
  3. Activate the environment before running the analysis

This ensures all required dependencies are available on the HPC without requiring manual environment setup.

3. Check job status

toxolib hpc-status --job-id your_job_id

4. Download results when complete

toxolib hpc-download --job-id your_job_id --output-dir ./local_results

5. HPC File Management

Toxolib provides several commands to manage files and directories on the HPC:

Interactive HPC Shell
toxolib hpc-shell

This starts an interactive shell session with the HPC that keeps the connection open until you explicitly exit. Features include:

  • Persistent connection until you type exit or quit
  • Colored prompt showing username, hostname, and current directory
  • Built-in commands like help, cd, and pwd
  • Direct execution of any shell command
Keeping Connection Open for Any Command

You can add the --keep-open flag to any HPC command to keep the connection open and start an interactive shell after executing the command:

# Execute a command and then start an interactive shell
toxolib hpc-pwd --keep-open
toxolib hpc-ls --keep-open
toxolib hpc-cd --path /some/path --keep-open
toxolib hpc-mkdir --path /new/directory --keep-open
Get Current Working Directory
toxolib hpc-pwd
Change Directory
toxolib hpc-cd --path /path/to/directory
# Go up one level
toxolib hpc-cd --path ..
Create Directory
toxolib hpc-mkdir --path /path/to/new/directory
List Files
# List files in current directory
toxolib hpc-ls

# List files in specific directory
toxolib hpc-ls --path /path/to/directory

# Long format listing (like ls -l)
toxolib hpc-ls --long

# Show hidden files (like ls -a)
toxolib hpc-ls --all

# Combine options
toxolib hpc-ls --path /path/to/directory --long --all

Manual Setup on HPC

When using the HPC functionality, you can manually upload and extract these databases on your HPC system:

# On your local machine, download the databases
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz
wget https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip

# Upload to HPC (using scp)
scp k2_standard_20240112.tar.gz your-username@your-hpc-server.edu:/path/on/hpc/
scp corn_db.zip your-username@your-hpc-server.edu:/path/on/hpc/

# SSH into HPC and extract
ssh your-username@your-hpc-server.edu
mkdir -p /path/on/hpc/kraken2_db
tar -xzf /path/on/hpc/k2_standard_20240112.tar.gz -C /path/on/hpc/kraken2_db
mkdir -p /path/on/hpc/corn_db
unzip /path/on/hpc/corn_db.zip -d /path/on/hpc/corn_db

Then when running toxolib, specify these paths:

toxolib hpc -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o /path/on/hpc/output_dir \
    --kraken-db /path/on/hpc/kraken2_db \
    --corn-db /path/on/hpc/corn_db

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toxolib-0.1.11.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toxolib-0.1.11-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file toxolib-0.1.11.tar.gz.

File metadata

  • Download URL: toxolib-0.1.11.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for toxolib-0.1.11.tar.gz
Algorithm Hash digest
SHA256 f776c9e6be10ceb7edf22502e082eb2f06b846da62b964bc6ec5114a1d14729d
MD5 81db939f3c37932fbd3dfbd839f1dbee
BLAKE2b-256 d56d7825fd1192ffe10d2196a05c27bb24b3b09b7f20f70955841e5b7e0ea769

See more details on using hashes here.

File details

Details for the file toxolib-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: toxolib-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for toxolib-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 92200f8b810eb54d179cdb94d9a41ded6550869a1799fb576729f2d684e9618a
MD5 2462097b2255fa6131039735c80f8041
BLAKE2b-256 5949777c35c6c8f6d982432634ddd3f452ceb07eaff411284bff3d6db5e3d3f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page