Skip to main content

An Apache Airflow provider for whylogs

Project description

whylogs Airflow Operator

This is a package for the whylogs provider, the open source standard for data and ML logging. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to:

  • Track changes in their dataset
  • Create data constraints to know whether their data looks the way it should
  • Quickly visualize key summary statistics about their datasets

This Airflow operator focuses on simplifying whylogs' usage along with Airflow. Users are encouraged to benefit from their existing Data Profiles, which are created with whylogs and can bring a lot of value and visibility to track their data changes over time.

Installation

You can install this package on top of an existing Airflow 2.0+ installation (Requirements) by simply running:

$ pip install airflow-provider-whylogs

To install this provider from source, run these instead:

$ git clone git@github.com:whylabs/airflow-provider-whylogs.git
$ cd airflow-provider-whylogs
$ python3 -m venv .env && source .env/bin/activate
$ pip3 install -e .

Usage example

In order to benefir from the existing operators, users will have to profile their data first, with their processing environment of choice. To create and store a profile locally, run the following command on a pandas DataFrame:

import whylogs as why

df = pd.read_csv("some_file.csv")
results = why.log(df)
results.writer("local").write()

And after that, you can use our operators to either:

  • Create a Summary Drift Report, to visually help you identify if there was drift in your data
from whylogs_provider.operators.whylogs import WhylogsSummaryDriftOperator

summary_drift = WhylogsSummaryDriftOperator(
        task_id="drift_report",
        target_profile_path="data/profile.bin",
        reference_profile_path="data/profile.bin",
        reader="local",
        write_report_path="data/Profile.html",
    )
  • Run a Constraints check, to check if your profiled data met some criteria
from whylogs_provider.operators.whylogs import WhylogsConstraintsOperator
from whylogs.core.constraints.factories import greater_than_number

constraints = WhylogsConstraintsOperator(
        task_id="constraints_check",
        profile_path="data/profile.bin",
        reader="local",
        constraint=greater_than_number(column_name="my_column", number=0.0),
    )

NOTE: It is important to note that even though it is possible to create a Dataset Profile with the Python Operator, Airflow tries to separate the concern of orchestration from processing, so that is one of the reasons why we didn't want to have a strong opinion on how to read data and profile it, enabling users to best adjust this step to their existing scenario.

A full DAG example can be found on the whylogs_provider package directory.

Requirements

The current requirements to use this Airflow Provider are described on the table below.

PIP package Version required
apache-airflow >=2.0
whylogs[viz, s3] >=1.0.10

Contributing

Users are always welcome to ask questions and contribute to this repository, by submitting issues and communicating with us through our community Slack. Feel free to reach out and make whylogs even more awesome to use with Airflow.

Happy coding! 😄

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airflow-provider-whylogs-0.0.3.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

airflow_provider_whylogs-0.0.3-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file airflow-provider-whylogs-0.0.3.tar.gz.

File metadata

File hashes

Hashes for airflow-provider-whylogs-0.0.3.tar.gz
Algorithm Hash digest
SHA256 272af3c6f0e6a14cc01f724709f79bf62fff15ac5a95b65aba8cb8a519868d11
MD5 447d5722ed61a1f24d4e403d7c13d3e3
BLAKE2b-256 c5be71c7f4747baac90517b27ddc7761c367114de726b6ac1e92123c8ff0f2a0

See more details on using hashes here.

File details

Details for the file airflow_provider_whylogs-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for airflow_provider_whylogs-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 792d79a755afb0bc06d4c264f9a7d58afbd526d57e6a7b7f40ca671c824dd04d
MD5 2d21cbb46d757849871418bc07d2e3d2
BLAKE2b-256 fcf588852264c520b1bba74ccc4d68f0251016f953372d7236034f8e7ddbe959

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page