A timeseries data quality control and processing tool/framework
Project description
System for automated Quality Control (SaQC)
Anomalies and errors are the rule not the exception when working with time series data. This is especially true, if such data originates from in-situ measurements of environmental properties. Almost all applications, however, implicily rely on data, that complies with some definition of 'correct'. In order to infer reliable data products and tools, there is no alternative to quality control. SaQC provides all the building blocks to comfortably bridge the gap between 'usually faulty' and 'expected to be corrected' in a accessible, consistent, objective and reproducible way.
For a (continously improving) overview of features, typical usage patterns,
the specific system components and how to customize SaQC
to your specific
needs, please refer to our
online documentation.
Installation
SaQC is available on the Python Package Index (PyPI) and can be installed using pip:
python -m pip install saqc
For a more detailed installion guide, see the installation guide.
Usage
SaQC
is both, a command line application controlled by a text based configuration
and a python module with a simple API.
SaQC as a command line application
The command line application is controlled by a semicolon-separated text file listing the variables in the dataset and the routines to inspect, quality control and/or process them. The content of such a configuration could look like this:
varname ; test
#----------; ---------------------------------------------------------------------
SM2 ; shift(freq="15Min")
'SM(1|2)+' ; flagMissing()
SM1 ; flagRange(min=10, max=60)
SM2 ; flagRange(min=10, max=40)
SM2 ; flagMAD(window="30d", z=3.5)
Dummy ; flagGeneric(field=["SM1", "SM2"], func=(isflagged(x) | isflagged(y)))
As soon as the basic inputs, dataset and configuration file, are
prepared, run SaQC
:
saqc \
--config PATH_TO_CONFIGURATION \
--data PATH_TO_DATA \
--outfile PATH_TO_OUTPUT
A full SaQC
run against provided example data can be invoked with:
saqc \
--config https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/config.csv \
--data https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/data.csv \
--outfile saqc_test.csv
SaQC as a python module
The following snippet implements the same configuration given above through the Python-API:
import pandas as pd
from saqc import SaQC
data = pd.read_csv(
"https://git.ufz.de/rdm-software/saqc/raw/develop/docs/resources/data/data.csv",
index_col=0, parse_dates=True,
)
saqc = SaQC(data=data)
saqc = (saqc
.shift("SM2", freq="15Min")
.flagMissing("SM(1|2)+", regex=True)
.flagRange("SM1", min=10, max=60)
.flagRange("SM2", min=10, max=40)
.flagMAD("SM2", window="30d", z=3.5)
.flagGeneric(field=["SM1", "SM2"], target="Dummy", func=lambda x, y: (isflagged(x) | isflagged(y))))
A more detailed description of the Python API is available in the respective section of the documentation.
Changelog
All notable changes to this project will be documented in CHANGELOG.md.
Get involved
Contributing
You found a bug or you want to suggest some cool features? Please refer to our contributing guidelines to see how you can contribute to SaQC.
User support
If you need help or have a question, you can use the SaQC user support mailing list: saqc-support@ufz.de
Copyright and License
Copyright(c) 2021, Helmholtz-Zentrum für Umweltforschung GmbH -- UFZ. All rights reserved.
- Documentation: Creative Commons Attribution 4.0 International
- Source code: GNU General Public License 3
For full details, see LICENSE.
Acknowledgements
...
Publications
coming soon...
How to cite SaQC
If SaQC is advancing your research, please cite as:
Schäfer, David; Palm, Bert; Lünenschloß, Peter. (2021). System for automated Quality Control - SaQC. Zenodo. https://doi.org/10.5281/zenodo.5888547
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.