Python package for performing quality control (QC) for data coordination (DC)
Project description
py-dcqc
Python package for performing quality control (QC) for data coordination (DC)
Purpose
This Python package provides a framework for performing quality control (QC) on data files. Quality control can range from low-level integrity checks (e.g. MD5 checksum, file extension) to high-level checks such as conformance to a format specification and consistency with associated metadata.
The tool is designed to be flexible and extensible, allowing for:
- File integrity validation
- Format specification conformance
- Metadata consistency checks
- Custom test suite creation
- Integration with external QC tools
- Batch processing of multiple files
- Comprehensive reporting in JSON format
Core Concepts
Files and FileTypes
A File represents a local or remote file along with its metadata. Each file has an associated FileType that bundles information about:
- Valid file extensions
- EDAM format ontology identifiers
- File type-specific validation rules
Built-in file types include TXT, JSON, JSON-LD, TIFF, OME-TIFF, TSV, CSV, BAM, FASTQ, and HDF5.
Targets
A Target represents one or more files that should be validated together. There are two types of targets:
SingleTarget: For validating individual filesPairedTarget: For validating exactly two related files together (e.g., paired-end sequencing data)
Tests
Tests are individual validation checks that can be run on targets. There are two types of tests:
-
Internal Tests: Tests written and executed in Python
- File extension validation
- Metadata consistency checks
- Format validation
-
External Tests: Tests that utilize external tools or processes
- File integrity checks (MD5, checksums)
- Format-specific validation tools
- Custom validation scripts
Tests are further organized into tiers:
-
Tier 1 - File Integrity: Checking that the file is whole and "available". These tests verify basic file integrity and usually require additional information, including:
- MD5 checksum verification
- Expected file extension checks
- Format-specific checks (e.g., first/last bytes)
- Decompression checks if applicable
-
Tier 2 - Internal Conformance: Checking that the file is internally consistent and compliant with its stated format. These tests only need the files themselves and their format specification:
- File format validation using available tools
- Internal metadata validation against schema (e.g., OME XML)
- Additional checks on internal metadata
-
Tier 3 - External Conformance: Checking that file features are consistent with separately submitted metadata. These tests use additional information but remain objective/quantitative:
- Channel count consistency
- File/image size consistency
- Antibody nomenclature conformance
- Secondary file presence (e.g., CRAI file for CRAM)
-
Tier 4 - Subjective Conformance: Checking files against qualitative criteria that may need expert review. These tests often involve metrics, heuristics, or sophisticated models:
- Sample swap detection
- PHI detection in images and metadata
- Outlier detection using metrics (e.g., file size)
Suites
A Suite is a collection of tests that are specific to a particular file type (e.g., FASTQ, BAM, CSV). Each file type has its own suite of tests that are appropriate for that format. Suites:
- Group tests together based on the target file type
- Can specify required vs optional tests:
- By default, Tier 1 (File Integrity) and Tier 2 (Internal Conformance) tests are required
- Users can explicitly specify which tests are required by name
- Allow tests to be skipped if specified in the suite
- Provide overall validation status:
- GREEN: All tests passed
- RED: One or more required tests failed
- AMBER: All required tests passed, but optional tests failed
- GREY: Error occurred during testing
Reports
Reports provide structured output of test results in various formats:
- JSON reports for machine readability
- CSV updates for batch processing
- Detailed test status and error messages
- Aggregated results across multiple suites
Installation
You can install py-dcqc directly from PyPI:
pip install dcqc
For development installation from source:
git clone https://github.com/Sage-Bionetworks-Workflows/py-dcqc.git
cd py-dcqc
pip install -e .
Docker
You can also use the official Docker container:
docker pull ghcr.io/sage-bionetworks-workflows/py-dcqc:latest
To run commands using the Docker container:
docker run ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc --help
For processing local files, remember to mount your data directory:
docker run -v /path/to/your/data:/data ghcr.io/sage-bionetworks-workflows/py-dcqc:latest dcqc qc_file --input-file /data/myfile.csv --file-type csv
Command Line Interface
To see all available commands and their options:
dcqc --help
Main commands include:
create_targets: Create target JSON files from a targets CSV filecreate_tests: Create test JSON files from a target JSON filecreate_process: Create external process JSON file from a test JSON filecompute_test: Compute the test status from a test JSON filecreate_suite: Create a suite from a set of test JSON files sharing the same targetcombine_suites: Combine several suite JSON files into a single JSON reportlist_tests: List the tests available for each file typeqc_file: Run QC tests on a single file (external tests are skipped)update_csv: Update input CSV file with dcqc_status column
For detailed help on any command:
dcqc <command> --help
Example Usage
Basic File QC
Run QC on a single file:
dcqc qc-file --input-file data.csv --file-type csv --metadata '{"author": "John Doe"}'
Creating and Running Test Suites
- Create targets from a CSV file:
dcqc create-targets input_targets.csv output_dir/
- Create tests for a target:
dcqc create-tests target.json tests_dir/ --required-tests "ChecksumTest" "FormatTest"
- Run tests and create a suite:
dcqc create-suite --output-json results.json test1.json test2.json test3.json
Listing Available Tests
To see all available tests for different file types:
dcqc list-tests
Integration with nf-dcqc
Early versions of this package were developed to be used by its sibling, the nf-dcqc Nextflow workflow. The initial command-line interface was developed with nf-dcqc in mind, favoring smaller steps to enable parallelism in Nextflow.
PyScaffold
This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.
putup --name dcqc --markdown --github-actions --pre-commit --license Apache-2.0 py-dcqc
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dcqc-1.8.0.tar.gz.
File metadata
- Download URL: dcqc-1.8.0.tar.gz
- Upload date:
- Size: 7.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e63287642ddd480d083ee81cc5120175159ba3e73dca7b4aa6d76675c0440c0b
|
|
| MD5 |
0c25a363accebb315ddf6e6b6b4d956d
|
|
| BLAKE2b-256 |
607be754cc8916ad18f067ea8daf3346ccabb50f32f013bff8c7a51dc9c8ccbb
|
File details
Details for the file dcqc-1.8.0-py3-none-any.whl.
File metadata
- Download URL: dcqc-1.8.0-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d99edb887180f033677d6b9015a5b2dfe759f0d450f5ef90f7020c78c01764d
|
|
| MD5 |
2e4b8f63661cebb1104a32f2238737a0
|
|
| BLAKE2b-256 |
d8ed9e0ac25cc97bbfa307777087e3fa46e8c1a46d2365d75ce07b0745c971b8
|