Python package for performing quality control (QC) for data coordination (DC)

These details have not been verified by PyPI

Project links

Project description

py-dcqc

Python package for performing quality control (QC) for data coordination (DC)

Purpose

This Python package provides a framework for performing quality control (QC) on data files. Quality control can range from low-level integrity checks (e.g. MD5 checksum, file extension) to high-level checks such as conformance to a format specification and consistency with associated metadata.

The tool is designed to be flexible and extensible, allowing for:

File integrity validation
Format specification conformance
Metadata consistency checks
Custom test suite creation
Integration with external QC tools
Batch processing of multiple files
Comprehensive reporting in JSON format

Core Concepts

Files and FileTypes

A File represents a local or remote file along with its metadata. Each file has an associated FileType that bundles information about:

Valid file extensions
EDAM format ontology identifiers
File type-specific validation rules

Built-in file types include TXT, JSON, JSON-LD, TIFF, OME-TIFF, TSV, CSV, BAM, FASTQ, and HDF5.

Targets

A Target represents one or more files that should be validated together. There are two types of targets:

SingleTarget: For validating individual files
PairedTarget: For validating exactly two related files together (e.g., paired-end sequencing data)

Tests

Tests are individual validation checks that can be run on targets. There are two types of tests:

Internal Tests: Tests written and executed in Python
- File extension validation
- Metadata consistency checks
- Format validation
External Tests: Tests that utilize external tools or processes
- File integrity checks (MD5, checksums)
- Format-specific validation tools
- Custom validation scripts

Tests are further organized into tiers:

Tier 1 - File Integrity: Checking that the file is whole and "available". These tests verify basic file integrity and usually require additional information, including:
- MD5 checksum verification
- Expected file extension checks
- Format-specific checks (e.g., first/last bytes)
- Decompression checks if applicable
Tier 2 - Internal Conformance: Checking that the file is internally consistent and compliant with its stated format. These tests only need the files themselves and their format specification:
- File format validation using available tools
- Internal metadata validation against schema (e.g., OME XML)
- Additional checks on internal metadata
Tier 3 - External Conformance: Checking that file features are consistent with separately submitted metadata. These tests use additional information but remain objective/quantitative:
- Channel count consistency
- File/image size consistency
- Antibody nomenclature conformance
- Secondary file presence (e.g., CRAI file for CRAM)
Tier 4 - Subjective Conformance: Checking files against qualitative criteria that may need expert review. These tests often involve metrics, heuristics, or sophisticated models:
- Sample swap detection
- PHI detection in images and metadata
- Outlier detection using metrics (e.g., file size)

Suites

A Suite is a collection of tests that are specific to a particular file type (e.g., FASTQ, BAM, CSV). Each file type has its own suite of tests that are appropriate for that format. Suites:

Group tests together based on the target file type
Can specify required vs optional tests:
- By default, Tier 1 (File Integrity) and Tier 2 (Internal Conformance) tests are required
- Users can explicitly specify which tests are required by name
Allow tests to be skipped if specified in the suite
Provide overall validation status:
- GREEN: All tests passed
- RED: One or more required tests failed
- AMBER: All required tests passed, but optional tests failed
- GREY: Error occurred during testing

Reports

Reports provide structured output of test results in various formats:

JSON reports for machine readability
CSV updates for batch processing
Detailed test status and error messages
Aggregated results across multiple suites

Installation

You can install py-dcqc directly from PyPI:

pip install dcqc

For development installation from source:

git clone https://github.com/Sage-Bionetworks-Workflows/py-dcqc.git
cd py-dcqc
pip install -e .

Docker

You can also use the official Docker container:

docker pull ghcr.io/sage-bionetworks-workflows/py-dcqc:latest

To run commands using the Docker container:

docker run ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc --help

For processing local files, remember to mount your data directory:

docker run -v /path/to/your/data:/data ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc qc_file --input-file /data/myfile.csv --file-type csv

Command Line Interface

To see all available commands and their options:

dcqc --help

Main commands include:

create_targets: Create target JSON files from a targets CSV file
create_tests: Create test JSON files from a target JSON file
create_process: Create external process JSON file from a test JSON file
compute_test: Compute the test status from a test JSON file
create_suite: Create a suite from a set of test JSON files sharing the same target
combine_suites: Combine several suite JSON files into a single JSON report
list_tests: List the tests available for each file type
qc_file: Run QC tests on a single file (external tests are skipped)
update_csv: Update input CSV file with dcqc_status column

For detailed help on any command:

dcqc <command> --help

Example Usage

Basic File QC

Run QC on a single file:

dcqc qc-file --input-file data.csv --file-type csv --metadata '{"author": "John Doe"}'

Creating and Running Test Suites

Create targets from a CSV file:

dcqc create-targets input_targets.csv output_dir/

Create tests for a target:

dcqc create-tests target.json tests_dir/ --required-tests "ChecksumTest" "FormatTest"

Run tests and create a suite:

dcqc create-suite --output-json results.json test1.json test2.json test3.json

Listing Available Tests

To see all available tests for different file types:

dcqc list-tests

Integration with nf-dcqc

Early versions of this package were developed to be used by its sibling, the nf-dcqc Nextflow workflow. The initial command-line interface was developed with nf-dcqc in mind, favoring smaller steps to enable parallelism in Nextflow.

PyScaffold

This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.

putup --name dcqc --markdown --github-actions --pre-commit --license Apache-2.0 py-dcqc

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.8.0

Nov 10, 2025

1.7.5

Nov 4, 2025

1.7.4

Sep 5, 2025

1.7.3

Aug 28, 2025

1.7.2

Aug 26, 2025

1.7.2.dev0 pre-release

Aug 27, 2025

1.7.1

May 10, 2024

1.7.0

May 10, 2024

1.6.5

Aug 23, 2023

1.6.4

Jul 31, 2023

1.6.3

Jul 14, 2023

1.6.2

Jun 26, 2023

1.6.1

Jun 23, 2023

1.6.0

Jun 19, 2023

1.5.0

Jun 14, 2023

1.4.0

May 30, 2023

1.3.0

Apr 12, 2023

1.2.0

Apr 3, 2023

1.1.0

Mar 8, 2023

1.0.0

Jan 29, 2023

0.3.0

Jan 26, 2023

0.2.0

Jan 14, 2023

0.1.0

Jan 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcqc-1.8.0.tar.gz (7.8 MB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dcqc-1.8.0-py3-none-any.whl (39.2 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file dcqc-1.8.0.tar.gz.

File metadata

Download URL: dcqc-1.8.0.tar.gz
Upload date: Nov 10, 2025
Size: 7.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dcqc-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`e63287642ddd480d083ee81cc5120175159ba3e73dca7b4aa6d76675c0440c0b`
MD5	`0c25a363accebb315ddf6e6b6b4d956d`
BLAKE2b-256	`607be754cc8916ad18f067ea8daf3346ccabb50f32f013bff8c7a51dc9c8ccbb`

See more details on using hashes here.

File details

Details for the file dcqc-1.8.0-py3-none-any.whl.

File metadata

Download URL: dcqc-1.8.0-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 39.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dcqc-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0d99edb887180f033677d6b9015a5b2dfe759f0d450f5ef90f7020c78c01764d`
MD5	`2e4b8f63661cebb1104a32f2238737a0`
BLAKE2b-256	`d8ed9e0ac25cc97bbfa307777087e3fa46e8c1a46d2365d75ce07b0745c971b8`

See more details on using hashes here.

dcqc 1.8.0

Navigation

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

py-dcqc

Purpose

Core Concepts

Files and FileTypes

Targets

Tests

Suites

Reports

Installation

Docker

Command Line Interface

Example Usage

Basic File QC

Creating and Running Test Suites

Listing Available Tests

Integration with nf-dcqc

PyScaffold

Project details

Verified details

Owner

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes