Profile and monitor your ML data pipeline end-to-end
Project description
The open source standard for data logging
Documentation • Slack Community • Python Quickstart
What is whylogs
whylogs is the open source standard for logging your data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to:
- Track changes in their dataset
- Create data constraints to know whether their data looks they way it should
- Quickly visualize key summary statistics about their datasets
These three functionalities enable a variety of use cases for data scientists, machine learning engineers, and data engineers:
- Detecting data drift (and resultant ML model performance degradation)
- Data quality validation
- Exploratory data analysis via data profiling
- Tracking data for ML experiments
- And many more
whylogs can be run in Python or Apache Spark (both PySpark and Scala) environments on a variety of data types. We integrate with lots of other tools including Pandas, AWS Sagemaker, MLflow, Flask, Ray, RAPIDS, Apache Kafka, and more.
If you have any questions, comments, or just want to hang out with us, please join our Slack Community. In addition to joining the Slack Community, you can also help this project by giving us a ⭐ in the upper right corner of this page.
Python Quickstart
Install whylogs
Install whylogs using the pip package manager by running
pip install whylogs
Log some data
whylogs is easy to get up and runnings
from whylogs import get_or_create_session
import pandas as pd
session = get_or_create_session()
df = pd.read_csv("path/to/file.csv")
with session.logger(dataset_name="my_dataset") as logger:
#dataframe
logger.log_dataframe(df)
#dict
logger.log({"name": 1})
#images
logger.log_image("path/to/image.png")
Table of Contents
- whylogs Profiles
- Visualizing Profiles
- Features
- Data Types
- Integrations
- Examples
- Community
- Roadmap
- Contribute
whylogs Profiles
whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.
whylogs profiles have three properties that make them ideal for data logging: they are descriptive, lightweight, and mergeable.
Descriptive: whylogs profiles describe the dataset that they represent. This high fidelity representation of datasets is what enables whylogs profiles to be effective snapshots of the data. They are better at capturing the characteristics of a dataset than a sample would be, as discussed in our Data Logging: Sampling versus Profiling blog post.
Lightweight: In addition to being a high fidelity representation of data, whylogs profiles also have high information density. You can easily profile terabytes or even petabytes of data in profiles that are only megabytes large. Because whylogs profiles are lightweight, they are very inexpensive to store, transport, and interact with.
Mergeable: One of the most powerful features of whylogs profiles is their mergeability. Mergeability means that whylogs profiles can be combined together to form new profiles which represent the aggregate of their constituent profiles. This enables logging for distributed and streaming systems, and allows users to view aggregated data across any time granularity.
Visualizing Profiles
Multiple profile plots
To view your logger profiles you can use, methods within whylogs.viz
:
vizualization = ProfileVisualizer()
vizualization.set_profiles([profile_day_1, profile_day_2])
figure= vizualization.plot_distribution("<feature_name>")
figure.savefig("/my/image/path.png")
Individual profiles are saved to disk, AWS S3, or WhyLabs API, automatically when loggers are closed, per the configuration found in the Session configuration.
Current profiles from active loggers can be loaded from memory with:
profile = logger.profile()
Profile viewer
You can also load a local profile viewer, where you upload the json
summary file. The default path for the json files is set as output/{dataset_name}/{session_id}/json/dataset_profile.json
.
from whylogs.viz import profile_viewer
profile_viewer()
This will open a viewer on your default browser where you can load a profile json summary, using the Select JSON profile
button:
Once the json is selected you can view your profile's features and
associated and statistics.
Features
whylogs collects approximate statistics and sketches of data on a column-basis into a statistical profile. These metrics include:
- Simple counters: boolean, null values, data types.
- Summary statistics: sum, min, max, median, variance.
- Unique value counter or cardinality: tracks an approximate unique value of your feature using HyperLogLog algorithm.
- Histograms for numerical features. whyLogs binary output can be queried to with dynamic binning based on the shape of your data.
- Top frequent items (default is 128). Note that this configuration affects the memory footprint, especially for text features.
Some other key features about whylogs:
- Accurate data profiling: whylogs calculates statistics from 100% of the data, never requiring sampling, ensuring an accurate representation of data distributions
- Lightweight runtime: whylogs utilizes approximate statistical methods to achieve minimal memory footprint that scales with the number of features in the data
- Any architecture: whylogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures
- Configuration-free: whylogs infers the schema of the data, requiring zero manual configuration to get started
- Tiny storage footprint: whylogs turns data batches and streams into statistical fingerprints, 10-100MB uncompressed
- Unlimited metrics: whylogs collects all possible statistical metrics about structured or unstructured data
Data Types
Whylogs supports both structured and unstructured data, specifically:
Data type | Features | Notebook Example |
---|---|---|
Structured Data | Distribution, cardinality, schema, counts, missing values | Getting started with structured data |
Images | exif metadata, derived pixels features, bounding boxes | Getting started with images |
Video | In development | Github Issue #214 |
Tensors | derived 1d features (more in developement) | Github Issue #216 |
Text | top k values, counts, cardinality | String Features |
Audio | In developement | Github Issue #212 |
Integrations
Integration | Features | Resources |
---|---|---|
Spark | Run whylogs in Apache Spark environment | |
Pandas | Log and monitor any pandas dataframe | |
Kafka | Log and monitor Kafka topics with whylogs | |
MLflow | Enhance MLflow metrics with whylogs: | |
Github actions | Unit test data with whylogs and github actions | |
RAPIDS | Use whylogs in RAPIDS environment | |
Java | Run whylogs in Java environment | |
Docker | Run whylogs as in Docker | |
AWS S3 | Store whylogs profiles in S3 |
Examples
For a full set of our examples, please check out whylogs-examples.
Check out our example notebooks with Binder:
Roadmap
whylogs is maintained by WhyLabs.
Community
If you have any questions, comments, or just want to hang out with us, please join our Slack channel.
Contribute
We welcome contributions to whylogs. Please see our contribution guide and our development guide for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.