Skip to main content

Fast and efficient data quality assessment for IoT timeseries data.

Project description

IoT-DQA

The IoT-DQA library is a Python package designed to streamline Data Quality Assessment (DQA) for IoT time-series data. It provides robust tools for validating and analyzing IoT data streams, ensuring reliable data for downstream applications.


Documentation: https://jeafreezy.github.io/iot-dqa/

Source Code: https://github.com/jeafreezy/iot-dqa


Key Features

  • Optimized Performance: Handles large-scale IoT datasets efficiently, powered by the high-performance Polars library.
  • Streamlined Validation: Simplifies the process of validating and analyzing IoT data streams.
  • Custom Metrics: Tailor metrics to meet specific requirements.
  • Comprehensive Scoring: Generates detailed data quality scores across multiple dimensions.
  • Seamless Integration: Export results in formats like CSV and GeoJSON for easy integration with other tools.

Dimensions of Data Quality

  • Validity: Verifies data adherence to expected formats and ranges.
  • Accuracy: Identifies and quantifies outliers using advanced techniques.
  • Completeness: Evaluates the presence of missing or null values.
  • Timeliness: Measures data arrival punctuality based on timestamps.

Note:

  • Designed for cumulative time-series data (e.g., utility consumption).
  • Sample data is available in tests/test_data.csv.

Installation

pip install iot-dqa

Quick Start

Example: Calculate Data Quality Score for IoT time-series data

from iot_dqa import DataQualityScore, Dimension, OutlierDetectionAlgorithm, CompletenessStrategy

# Initialize and compute the Data Quality Score
dq_score = DataQualityScore(
    "./data/sample_data.csv",
    multiple_devices=True,
    dimensions=[
        Dimension.VALIDITY.value,
        Dimension.ACCURACY.value,
        Dimension.COMPLETENESS.value,
        Dimension.TIMELINESS.value,
    ],
    col_mapping={
        "latitude": "LAT",
        "longitude": "LONG",
        "date": "DATE",
        "value": "VALUE",
        "id": "DEVICE_ID",
    },
    metrics_config={
        "timeliness": {"iat_method": "min"},
        "accuracy": {
            "ensemble": True,
            "strategy": "validity",
            "algorithms": [
                OutlierDetectionAlgorithm.IF.value,
                OutlierDetectionAlgorithm.IQR.value,
                OutlierDetectionAlgorithm.MAD.value,
            ],
        },
        "completeness_strategy": CompletenessStrategy.ONLY_NULLS.value,
    },
).compute_score(
    weighting_mechanism="ahp",
    output_format="geojson",
    output_path="./output",
    ahp_weights={
        Dimension.VALIDITY.value: 0.3,
        Dimension.ACCURACY.value: 0.3,
        Dimension.COMPLETENESS.value: 0.3,
        Dimension.TIMELINESS.value: 0.1,
    },
)

print("Data Quality Score computed successfully!")

Configuration Overview

Configuration Attribute Default Value Description
Isolation Forest n_estimators 100 Number of trees in the forest.
max_samples 0.8 Proportion of samples for training each base estimator.
contamination 0.1 Proportion of outliers in the dataset.
max_features 1 Number of features for training each base estimator.
random_state 42 Random seed for reproducibility.
Accuracy ensemble True Use ensemble methods for accuracy.
mad_threshold 3 Threshold for Median Absolute Deviation (MAD).
optimize_iqr_with_optuna True Enable IQR optimization using Optuna.
iqr_optuna_q1_max 0.5 Maximum value for Q1 in IQR optimization.
iqr_optuna_q3_min 0.5 Minimum value for Q3 in IQR optimization.
iqr_optuna_q3_max 1 Maximum value for Q3 in IQR optimization.
algorithms All algorithms List of outlier detection algorithms.
strategy NONE Strategy for accuracy computation.
Timeliness iat_method min Method to calculate inter-arrival time.
Completeness completeness_strategy ONLY_NULLS Strategy for handling completeness.

For more details on configuration, refer to the documentation.

Documentation

Visit the documentation for comprehensive details.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iot_dqa-0.0.6.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iot_dqa-0.0.6-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file iot_dqa-0.0.6.tar.gz.

File metadata

  • Download URL: iot_dqa-0.0.6.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iot_dqa-0.0.6.tar.gz
Algorithm Hash digest
SHA256 b0f6744d4aa51f21e30161144967f3e763621e7450c2525109a73645a1a8dbeb
MD5 ffd8c4517b9e36a3f8a2e88cc9ad5647
BLAKE2b-256 714e9beda199312839a36abcd6204ca1af377d8de75544a5425c1b92af45765a

See more details on using hashes here.

File details

Details for the file iot_dqa-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: iot_dqa-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iot_dqa-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a78b3b44c4b60ca09a07d26875fd99f4e4ce1b6ac967fb05ec383bbe71e22274
MD5 2d0bf0629e7c2821dd0a3ac35277169f
BLAKE2b-256 c32ea99a13c8f02c126cae292b5a067f37d3e0cecf5dbf2c9b787b3f3ee804e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page