Skip to main content

A tool to upload Fastq files to the INSaFLU database and perform metagenomics pathogen detection

Project description

findONTime

A tool to upload fastq files (fastq or fastq.gz format) to the INSaFLU-TELEVIR platform and launch the metagenomics pathogen detection analysis using the TELEVIR module

Motivation

Reducing the time needed for pathogen detection and the sequencing costs per sample is crucial in the context of diagnostics using metagenomics sequencing. In fact, when performing hypothesis-free viral diagnosis by sequencing complex biological samples, the proportion of the virus in a sample is unknown. As such, the amount of sequencing data, and consequently run length, needed to accurately detect the virus cannot be predicted a priori. [name of the tool] runs concurrently with MinION sequencing and monitors the FASTQ files that are being generated in real-time for each sample, merges the files (at user defined time intervals), uploads them to the INSaFLU-TELEVIR platform and launches the metagenomics virus detection analysis using the TELEVIR module. This allows users to detect a virus in a sample as early as possible during the sequencing run, reducing the time gap between obtaining the sample and the diagnosis, and also reducing sequencing costs (as ONT runs can be stopped at any time and the flow cells can be cleaned and reused). [name of the tool] can be used as a “start-to-end” solution or for particular tasks (e.g., merging ONT output files, metadata preparation and upload to INSaFLU-TELEVIR).

Introduction

The insaflu-upload tool uploads fastq files to the INSaFLU-TELEVIR platform (docker installation or local server), and launches themetagenomics pathogen detection analysis using the TELEVIR module. The tool relies on fastq-handler, a package to monitor and process outputs of ONT runs, upload the reads, launch TELEVIR projects and generates a report with the results.

Details

The user has the option to upload all files collected throughout the ONT run (sampling occurs at user-defined period) or only upload the last file (i.e, the file compiling all reads generated until the lastest sampling point). For upload, metadata files are also generated for each sequence file, according to the INSaFLU-TELEVIR input template file. Metadata files are stored in the metadata sub-directory following the output directory specified by the user.

Upload

insaflu-upload can interact with the INSaFLU-TELEVIR platfotm in two ways:

  • Docker. The user needs to have docker installed and running. The tool will then upload the files to the docker image. The user needs to provide the name of the docker image and the path for uploads.

  • SSH. The user needs to have access to the database server. The tool will then upload the files to the database using SSH. The user needs to provide the path for uploads and the credentials for the database server.

INSaFLU-TELEVIR

The tool creates one INSaFLU-TELEVIR project for each directory containing fastq files. The project name is the name of the directory. Files generated within the same directory are uploaded to the same project.

Input Files

  • fastq.gz - Output directory for the ONT run, containing sequence files. The files can be in subfolders. The files can be gzipped or not.

  • config.ini - A configuration file containing the parameters for the tool. The file is generated by the tool when it is run for the first time. The user can edit the file to change the parameters.

Config must contain:

  • section [INSAFLU] containing insaflu username and app directory path.

  • (optional) section [SSH] containing ssh credentials: username, ip_address and rsa key;

  • (optional) section [DOCKER] containing docker image name.

see example config.ini

API

usage: findontime [-h] -i IN_DIR -o OUT_DIR [-s SLEEP] [-n TAG] [--config CONFIG] [--max_size MAX_SIZE] [--merge] [--downsize] [--upload {last,all}] [--connect {docker,ssh}] [--keep_names] [--monitor] [--televir]

Process fastq files.

optional arguments:
  -h, --help            show this help message and exit
  -i IN_DIR, --in_dir IN_DIR
                        Input directory
  -o OUT_DIR, --out_dir OUT_DIR
                        Output directory
  -s SLEEP, --sleep SLEEP
                        Sleep time between checks in monitor mode
  -n TAG, --tag TAG     name tag, if given, will be added to the output file names
  --config CONFIG       config file
  --max_size MAX_SIZE   max size of the output file, in kilobytes
  --merge               merge files
  --downsize            downsize fastq files
  --upload {last,all}   file upload stategy (default: all)
  --connect {docker,ssh}
                        file upload stategy (default: docker)
  --keep_names          keep original file names
  --monitor             monitor directory until killed
  --televir             deploy televir pathogen identification on each sample

REQUIREMENTS

** Modules **

  • python 3.6 or higher
  • dataclasses==0.6
  • natsort==8.3.1
  • pandas==1.5.3
  • paramiko==3.1.0
  • pip==21.2.3
  • setuptools==57.4.0
  • xopen==1.7.0

INSTALLATION

python -m venv .venv
source .venv/bin/activate
python -m pip install findontime

USAGE

findontime -i test_run/ -o test_new -d test_new --tag another -s 5 --merge –televir

TESTING

Running pytest in the root directory will run all tests that do not interact with INSaFLU-TELEVIR. In order to test the upload and metagenomics functionalities, the user needs to provide a valid config file to a local docker installation, and to pass the --docker flag to pytest:

pytest --docker --config-file config.ini

OUTPUT

Note: The output directory structure is maintained.

  • fastq.gz files containing all reads from the previous files.
  • log.txt file containing the concatenation process.
  • metadata individual metadata files for each fastq file uploaded.
  • results.tsv file containing the results of the pathogen detection. One file per project.

Maintainers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findontime-0.1.0.tar.gz (34.9 kB view hashes)

Uploaded Source

Built Distribution

findontime-0.1.0-py3-none-any.whl (35.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page