Skip to main content

A package for the management of analyses and data in DNA metabarcoding.

Project description

The OBITools3: A package for the management of analyses and data in DNA metabarcoding

Website: https://metabarcoding.org/obitools3

DNA metabarcoding offers new perspectives for biodiversity research [1]. This approach of ecosystem studies relies heavily on the use of Next-Generation Sequencing (NGS), and consequently requires the ability to to treat large volumes of data. The OBITools package satisfies this requirement thanks to a set of programs specifically designed for analyzing NGS data in a DNA metabarcoding context [2] - https://metabarcoding.org/obitools. Their capacity to filter and edit sequences while taking into account taxonomic annotation helps to setup tailored-made analysis pipelines for a broad range of DNA metabarcoding applications, including biodiversity surveys or diet analyses.

The OBITools3. This new version of the OBITools looks to significantly improve the storage efficiency and the data processing speed. To this end, the OBITools3 rely on an ad hoc database system, inside which all the data that a DNA metabarcoding experiment must consider is stored: the sequences, the metadata (describing for instance the samples), the database containing the reference sequences used for the taxonomic annotation, as well as the taxonomic databases. Besides the gain in efficiency, this new structure allows an easier access to all the data associated with an experiment.

Column-oriented storage. An analysis pipeline corresponds to a succession of commands, each computing one step of the analysis, and where the result of the command n is used by the command n+1. DNA metabarcoding data can easily be represented in the form of tables, and each command can be regarded as an operation transforming one or several 'input' tables into one or several 'output' tables, which can be used by the next command. Many of the basic operations in a pipeline copy without modification an important part of the input tables to the result tables, and use for their calculations only a small part of the input data. In the original OBITools, those tables are kept in the form of annotated sequence files in the FASTA or FASTQ format. This has two consequences: i) keeping the transitional results of the analysis pipeline means using disk space for an important volume of redundant data, ii) The coding and decoding of informations that are not actually used represent an important part of the treatment process. The new database system used by the OBITools3 (called DMS for Data Management System) relies on column-oriented storage. The columns are immutable and can be assembled in views representing the data tables. This way, the data not modified by a command in an input table can easily be associated to the result table without duplicating any information ; and the data not used at all by a command can be associated with the result table without being read. This strategy results in a gain in disk space efficiency by limiting data redundancy, as well as a gain in execution time by limiting data reading, writing and conversion operations. Finally, as a mean to optimize data access, each column is stored in a binary file directly mapped in memory for reading and writing operations.

Storage optimization. DNA metabarcoding data is intrinsically very redundant. For example, the same sequence corresponding to a species will be present several thousand times across all samples. In order to limit the disk space used and make comparison operations more efficient, data in the form of character strings is stored in columns using a complex indexing structure, efficient on millions of values, coupling hash functions, Bloom filters and AVL trees. Finally, DNA sequences are compressed by encoding each nucleotide on two or four bits depending on whether the sequences contain only the four nucleotides (A, C, G, T) or use the IUPAC codes.

Saving the data processing history. The totality of the informations used by the OBITools3 is stored in immutable data structures in the DMS. If a command has to modify a column used as input to produce its result, a new version of that column is created, leaving the initial version intact. This storage system enables to keep, at minimal cost, the totality of the transitional results produced by the pipeline. The storage of metadata describing all the operations that have produced a view (a result table) in the DMS makes possible the creation of an oriented hypergraph, where each node corresponds to a view and each arrow to an operation. By retracing the dependency relationships in this hypergraph, it is possible to rebuild a posteriori the entirety of the process that has produced a result table.

Tools. The OBITools3 offer the same tools as the original OBITools, plus ecoPCR (in silico PCR) [4] and Sumatra (sequence alignment, not multithreaded yet) [5]. Eventually, new versions of ecoPrimers (PCR primer design) [3], as well as Sumaclust (sequence alignment and clustering) [5] will be added, taking advantage of the database structure developed for the OBITools3.

Implementation and disponibility. The lower layers managing the DMS as well as all the compute-intensive functions are coded in C99 for efficiency reasons. A Cython (http://www.cython.org) object layer allows for a simple but efficient implementation of the OBITools3 commands in Python 3. The OBITools3 are now being released, check the wiki for more information.

References.

  1. Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH: Environmental DNA. Mol Ecol 2012:1789–1793.
  2. Boyer F, Mercier C, Bonin A, Le Bras Y, Taberlet P, Coissac E: OBITools: a Unix-inspired software package for DNA metabarcoding. Mol Ecol Resour, 2016: 176-182.
  3. Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E: ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Res 2011, 39:e145.
  4. Ficetola GF, Coissac E, Zundel S, Riaz T, Shehzad W, Bessière J, Taberlet P, Pompanon F: An in silico approach for the evaluation of DNA barcodes. BMC Genomics 2010, 11:434.
  5. Mercier C, Boyer F, Bonin A, Coissac E (2013) SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. Available: http://metabarcoding.org/sumatra and http://metabarcoding.org/sumaclust

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

OBITools3-3.0.0b28.tar.gz (309.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

OBITools3-3.0.0b28-cp37-cp37m-macosx_10_9_x86_64.whl (11.3 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file OBITools3-3.0.0b28.tar.gz.

File metadata

  • Download URL: OBITools3-3.0.0b28.tar.gz
  • Upload date:
  • Size: 309.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for OBITools3-3.0.0b28.tar.gz
Algorithm Hash digest
SHA256 c28630fa25b863836fdf841428f1fc11b1c37c464b84fe562c7609b485c4ffa7
MD5 bbbed745cfc833391f28e2474f7e4d8a
BLAKE2b-256 410d77d9a3d522ba4c4db32b55b910921b849d0ca1212febacc743ee8f7fd648

See more details on using hashes here.

File details

Details for the file OBITools3-3.0.0b28-py3.7-macosx-10.9-x86_64.egg.

File metadata

  • Download URL: OBITools3-3.0.0b28-py3.7-macosx-10.9-x86_64.egg
  • Upload date:
  • Size: 11.4 MB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for OBITools3-3.0.0b28-py3.7-macosx-10.9-x86_64.egg
Algorithm Hash digest
SHA256 83a3cd401f09cbee5a2216d8b7e2748781517a6dc3d3557a95ea15e6ae55e091
MD5 f08d072bb823b5f5f1a45943290efd90
BLAKE2b-256 dfbfc571dd3449b18641903aade06db14178395349827897cab900930e666e69

See more details on using hashes here.

File details

Details for the file OBITools3-3.0.0b28-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: OBITools3-3.0.0b28-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 11.3 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for OBITools3-3.0.0b28-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5bc8d550784b29f93ecac61b9c0a1db7a630d98d6c15b983878b6b877030b7aa
MD5 6db9c972e65bd785bab24aefeb58d2cf
BLAKE2b-256 8390dca429a88985bd9a57d5899eea899fb3390709f00276510d23e21e9e59c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page