Skip to main content

Merge MGI fastq files

Project description

MergeGI

Tests

PyPI - Python Version

MergeGI provides a single command line to merge and select barcoded raw data from MGI sequencing runs into a set of FastQ files ready for subsequent bioinformatics analysis.

Installation

We provide MergeGI as a Python library available on Pypi. The standalone application is called mergegi and can be installed in an environment with Python>3.6 as follows:

pip install mergegi

There is no dependencies except for click package so that installation should be straightforward.

For developers:

git clone git@github.com:sequana/MergeGI.git
cd MergeGI
pip install -e .[testing]

Overview

The main goal of MergeGI is to select and merge the FastQ files generated by a MGI sequencer into a list of FastQ files directly usable for subsequent bioinformatics analysis. Why do we need to do this preprocessing ?

First, MGI generates one FastQ file per barcode. You may not need all those barcodes yet the demultiplexing performs a systematic search of all barcodes. Consequently, you will end up with FastQ files corresponding to your barcode and a bunch of FastQ files that should be ignored. Given the information from your wetlab colleagues you should have the list of samples and their relevant barcodes.

Second, MGI technologies imposes that barcodes being processed in a specific manner meaning that a given sample may be split into several barcodse (files). Therefore we need a tool to merge such files. Again, the wetlab should provide the barcodes corresponding to a given sample. See image below for more explanation

Third, a MGI flowcell has several lanes. You may want to merge the lanes or not.

Those 3 steps should be managed seemlessly by our tool given a sample sheet and the output directory of the MGI runs.

General Usage and Examples

The data structure expected by MergeGI is the expected output directoy of MGI runs:

OutputFq/Flowcell/L01
OutputFq/Flowcell/L02

Where L01/L02 stands for lane 1 and 2.

The software needs a sample sheet that describe the sample name, the associated barcode identifier, the potentially second barcode (if none, the column must still be present with empty strings), the project name (it will be used to create the new output directory), and the lane where is the sample/barcode pair. Here is an example:

samplename,barcode,barcode2,project,lane
A,         1,,              projectA, 1
B,         20,,              projectA, 1
A,         1,,              projectA, 2
C,         20,,              projectB, 2
C,         30,,              projectB, 1
B,         30,,              projectA, 2

If you have pooled a sample on the four lanes, meaning it is the same barcode on each lane, you can use the * character to simplify the sample sheet:

samplename,barcode,barcode2,project,lane
A,         1,      ,        ,projectA, *
B,         20,     ,        ,projectA, *

IMPORTANT NOTE1: the current version uses the barcode 1 only (column barcode).

IMPORTANT NOTE2: The header must be present. The header names are not important but columns must be sorted with the expected order: sample name, barcode 1, barcode 2, projetc name, lane.

Given the sample sheet, and the input directory (top level of the MGI runs), this command should create a new clean directory with the relevant FastQ files (here in merge_data directory):

mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data 

If the data is paired, add --paired argument

mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired

By default, lanes are merged. If this is not what you want you may disable this option:

mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired --no-merge

Changelog

========= ========================================================================== Version Description ========= ========================================================================== 0.1.0 * simplify the CI action workflow and setup 0.0.1 * first release

Barcode distribution example

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MergeGI-0.1.0.tar.gz (6.3 kB view details)

Uploaded Source

File details

Details for the file MergeGI-0.1.0.tar.gz.

File metadata

  • Download URL: MergeGI-0.1.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.3

File hashes

Hashes for MergeGI-0.1.0.tar.gz
Algorithm Hash digest
SHA256 94e293459a44cc8b7773646c094d72d3c5ee66c69b758c5049769b608e374cce
MD5 0b1edb4f6964a661433a8a407ee989ad
BLAKE2b-256 508c1ba0f6d59e4220d354622c8ed9b17340a7a3b9c0c3019d611503260c0e71

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page