Skip to main content

A timeseries data quality control and processing tool/framework

Project description



Project Status: Active – The project has reached a stable, usable state and is being actively developed.

SaQC: System for automated Quality Control

SaQC is a tool/framework/application to quality control time series data. It provides a growing collection of algorithms and methods to analyze, annotate and process timeseries data. It supports the end to end enrichment of metadata and provides various user interfaces: 1) a Python API, 2) a command line interface with a text based configuration system and a web based user interface

SaQC is designed with a particular focus on the needs of active data professionals, including sensor hardware-oriented engineers, domain experts, and data scientists, all of whom can benefit from its capabilities to improve the quality standards of given data products.

For a (continously improving) overview of features, typical usage patterns, the specific system components and how to customize SaQC to your own needs, please refer to our online documentation.

Installation

SaQC is available on the Python Package Index (PyPI) and can be installed using pip:

python -m pip install saqc

Additionally SaQC is available via conda and can be installed with:

conda create -c conda-forge -n saqc saqc

For more details, see the installation guide.

Usage

SaQC is both, a command line application controlled by a text based configuration and a python module with a simple API.

SaQC as a command line application

The command line application is controlled by a semicolon-separated text file listing the variables in the dataset and the routines to inspect, quality control and/or process them. The content of such a configuration could look like this:

varname    ; test
#----------; ---------------------------------------------------------------------
SM2        ; align(freq="15Min")
'SM(1|2)+' ; flagMissing()
SM1        ; flagRange(min=10, max=60)
SM2        ; flagRange(min=10, max=40)
SM2        ; flagZScore(window="30d", thresh=3.5, method='modified', center=False)
Dummy      ; flagGeneric(field=["SM1", "SM2"], func=(isflagged(x) | isflagged(y)))

As soon as the basic inputs, dataset and configuration file, are prepared, run SaQC:

saqc \
    --config PATH_TO_CONFIGURATION \
    --data PATH_TO_DATA \
    --outfile PATH_TO_OUTPUT

A full SaQC run against provided example data can be invoked with:

saqc \
    --config https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/config.csv \
    --data https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/data.csv \
    --outfile saqc_test.csv

SaQC as a python module

The following snippet implements the same configuration given above through the Python-API:

import pandas as pd
from saqc import SaQC

data = pd.read_csv(
    "https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/data.csv",
    index_col=0, parse_dates=True,
)

qc = SaQC(data=data)
qc = (qc
      .align("SM2", freq="15Min")
      .flagMissing("SM(1|2)+", regex=True)
      .flagRange("SM1", min=10, max=60)
      .flagRange("SM2", min=10, max=40)
      .flagZScore("SM2", window="30d", thresh=3.5, method='modified', center=False)
      .flagGeneric(field=["SM1", "SM2"], target="Dummy", func=lambda x, y: (isflagged(x) | isflagged(y))))

A more detailed description of the Python API is available in the respective section of the documentation.

Get involved

Contributing

You found a bug or you want to suggest new features? Please refer to our contributing guidelines to see how you can contribute to SaQC.

User support

If you need help or have questions, send us an email to saqc-support@ufz.de

Copyright and License

Copyright(c) 2021, Helmholtz-Zentrum für Umweltforschung GmbH -- UFZ. All rights reserved.

For full details, see LICENSE.

Publications

Lennart Schmidt, David Schäfer, Juliane Geller, Peter Lünenschloss, Bert Palm, Karsten Rinke, Corinna Rebmann, Michael Rode, Jan Bumberger, System for automated Quality Control (SaQC) to enable traceable and reproducible data streams in environmental science, Environmental Modelling & Software, 2023, 105809, ISSN 1364-8152, https://doi.org/10.1016/j.envsoft.2023.105809. (https://www.sciencedirect.com/science/article/pii/S1364815223001950)

How to cite SaQC

If SaQC is advancing your research, please cite as:

Schäfer, David, Palm, Bert, Lünenschloß, Peter, Schmidt, Lennart, & Bumberger, Jan. (2023). System for automated Quality Control - SaQC (2.3.0). Zenodo. https://doi.org/10.5281/zenodo.5888547

or

Lennart Schmidt, David Schäfer, Juliane Geller, Peter Lünenschloss, Bert Palm, Karsten Rinke, Corinna Rebmann, Michael Rode, Jan Bumberger, System for automated Quality Control (SaQC) to enable traceable and reproducible data streams in environmental science, Environmental Modelling & Software, 2023, 105809, ISSN 1364-8152, https://doi.org/10.1016/j.envsoft.2023.105809. (https://www.sciencedirect.com/science/article/pii/S1364815223001950)


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saqc-2.6.0.tar.gz (175.7 kB view details)

Uploaded Source

Built Distribution

saqc-2.6.0-py3-none-any.whl (216.4 kB view details)

Uploaded Python 3

File details

Details for the file saqc-2.6.0.tar.gz.

File metadata

  • Download URL: saqc-2.6.0.tar.gz
  • Upload date:
  • Size: 175.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for saqc-2.6.0.tar.gz
Algorithm Hash digest
SHA256 ffc3aaab539c778df5bfcf34f4395ac69e51654cc5386035a61ce65f96cff8a8
MD5 7bc5cf3c48b84912123499d503438215
BLAKE2b-256 6c20c6f70de90e83dd75c59bb2fed3cb16f7ccf972e1407774557cacd01cd863

See more details on using hashes here.

File details

Details for the file saqc-2.6.0-py3-none-any.whl.

File metadata

  • Download URL: saqc-2.6.0-py3-none-any.whl
  • Upload date:
  • Size: 216.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.8

File hashes

Hashes for saqc-2.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 198d09bf5050d43049e97dc7cf5150cb398f19bedb5197ce4267a3612e788309
MD5 5b2e3d2419b064f928a85599eb3ca633
BLAKE2b-256 261553b5dd2110d3efb09224bebf85471eaa74f25e8416d7d1d90e66c054772b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page