Skip to main content

Python package for performing quality control (QC) for data coordination (DC)

Project description

py-dcqc

PyPI-Server codecov Project generated with PyScaffold

Python package for performing quality control (QC) for data coordination (DC)

Purpose

This Python package provides a framework for performing quality control (QC) on data files. Quality control can range from low-level integrity checks (e.g. MD5 checksum, file extension) to high-level checks such as conformance to a format specification and consistency with associated metadata.

The tool is designed to be flexible and extensible, allowing for:

  • File integrity validation
  • Format specification conformance
  • Metadata consistency checks
  • Custom test suite creation
  • Integration with external QC tools
  • Batch processing of multiple files
  • Comprehensive reporting in JSON format

Core Concepts

Files and FileTypes

A File represents a local or remote file along with its metadata. Each file has an associated FileType that bundles information about:

  • Valid file extensions
  • EDAM format ontology identifiers
  • File type-specific validation rules

Built-in file types include TXT, JSON, JSON-LD, TIFF, OME-TIFF, TSV, CSV, BAM, FASTQ, and HDF5.

Targets

A Target represents one or more files that should be validated together. There are two types of targets:

  • SingleTarget: For validating individual files
  • PairedTarget: For validating exactly two related files together (e.g., paired-end sequencing data)

Tests

Tests are individual validation checks that can be run on targets. There are two types of tests:

  1. Internal Tests: Tests written and executed in Python

    • File extension validation
    • Metadata consistency checks
    • Format validation
  2. External Tests: Tests that utilize external tools or processes

    • File integrity checks (MD5, checksums)
    • Format-specific validation tools
    • Custom validation scripts

Tests are further organized into tiers:

  • Tier 1 - File Integrity: Checking that the file is whole and "available". These tests verify basic file integrity and usually require additional information, including:

    • MD5 checksum verification
    • Expected file extension checks
    • Format-specific checks (e.g., first/last bytes)
    • Decompression checks if applicable
  • Tier 2 - Internal Conformance: Checking that the file is internally consistent and compliant with its stated format. These tests only need the files themselves and their format specification:

    • File format validation using available tools
    • Internal metadata validation against schema (e.g., OME XML)
    • Additional checks on internal metadata
  • Tier 3 - External Conformance: Checking that file features are consistent with separately submitted metadata. These tests use additional information but remain objective/quantitative:

    • Channel count consistency
    • File/image size consistency
    • Antibody nomenclature conformance
    • Secondary file presence (e.g., CRAI file for CRAM)
  • Tier 4 - Subjective Conformance: Checking files against qualitative criteria that may need expert review. These tests often involve metrics, heuristics, or sophisticated models:

    • Sample swap detection
    • PHI detection in images and metadata
    • Outlier detection using metrics (e.g., file size)

Suites

A Suite is a collection of tests that are specific to a particular file type (e.g., FASTQ, BAM, CSV). Each file type has its own suite of tests that are appropriate for that format. Suites:

  • Group tests together based on the target file type
  • Can specify required vs optional tests:
    • By default, Tier 1 (File Integrity) and Tier 2 (Internal Conformance) tests are required
    • Users can explicitly specify which tests are required by name
  • Allow tests to be skipped if specified in the suite
  • Provide overall validation status:
    • GREEN: All tests passed
    • RED: One or more required tests failed
    • AMBER: All required tests passed, but optional tests failed
    • GREY: Error occurred during testing

Reports

Reports provide structured output of test results in various formats:

  • JSON reports for machine readability
  • CSV updates for batch processing
  • Detailed test status and error messages
  • Aggregated results across multiple suites

Installation

You can install py-dcqc directly from PyPI:

pip install dcqc

For development installation from source:

git clone https://github.com/Sage-Bionetworks-Workflows/py-dcqc.git
cd py-dcqc
pip install -e .

Docker

You can also use the official Docker container:

docker pull ghcr.io/sage-bionetworks-workflows/py-dcqc:latest

To run commands using the Docker container:

docker run ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc --help

For processing local files, remember to mount your data directory:

docker run -v /path/to/your/data:/data ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc qc_file --input-file /data/myfile.csv --file-type csv

Command Line Interface

To see all available commands and their options:

dcqc --help

Main commands include:

  • create_targets: Create target JSON files from a targets CSV file
  • create_tests: Create test JSON files from a target JSON file
  • create_process: Create external process JSON file from a test JSON file
  • compute_test: Compute the test status from a test JSON file
  • create_suite: Create a suite from a set of test JSON files sharing the same target
  • combine_suites: Combine several suite JSON files into a single JSON report
  • list_tests: List the tests available for each file type
  • qc_file: Run QC tests on a single file (external tests are skipped)
  • update_csv: Update input CSV file with dcqc_status column

For detailed help on any command:

dcqc <command> --help

Example Usage

Basic File QC

Run QC on a single file:

dcqc qc-file --input-file data.csv --file-type csv --metadata '{"author": "John Doe"}'

Creating and Running Test Suites

  1. Create targets from a CSV file:
dcqc create-targets input_targets.csv output_dir/
  1. Create tests for a target:
dcqc create-tests target.json tests_dir/ --required-tests "ChecksumTest" "FormatTest"
  1. Run tests and create a suite:
dcqc create-suite --output-json results.json test1.json test2.json test3.json

Listing Available Tests

To see all available tests for different file types:

dcqc list-tests

Integration with nf-dcqc

Early versions of this package were developed to be used by its sibling, the nf-dcqc Nextflow workflow. The initial command-line interface was developed with nf-dcqc in mind, favoring smaller steps to enable parallelism in Nextflow.

PyScaffold

This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.

putup --name dcqc --markdown --github-actions --pre-commit --license Apache-2.0 py-dcqc

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcqc-1.8.0.tar.gz (7.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcqc-1.8.0-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file dcqc-1.8.0.tar.gz.

File metadata

  • Download URL: dcqc-1.8.0.tar.gz
  • Upload date:
  • Size: 7.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dcqc-1.8.0.tar.gz
Algorithm Hash digest
SHA256 e63287642ddd480d083ee81cc5120175159ba3e73dca7b4aa6d76675c0440c0b
MD5 0c25a363accebb315ddf6e6b6b4d956d
BLAKE2b-256 607be754cc8916ad18f067ea8daf3346ccabb50f32f013bff8c7a51dc9c8ccbb

See more details on using hashes here.

File details

Details for the file dcqc-1.8.0-py3-none-any.whl.

File metadata

  • Download URL: dcqc-1.8.0-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dcqc-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0d99edb887180f033677d6b9015a5b2dfe759f0d450f5ef90f7020c78c01764d
MD5 2e4b8f63661cebb1104a32f2238737a0
BLAKE2b-256 d8ed9e0ac25cc97bbfa307777087e3fa46e8c1a46d2365d75ce07b0745c971b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page