Skip to main content

A library for detecting critical data slices in structured and unstructured data based on features, metadata and model predictions.

Project description

Gray shape shifter

sliceguard

Detect problematic data slices in unstructured and structured data fast.

🚀 Introduction

sliceguard is built to quickly discover problematic data segments in your data. It aims at supporting structured data as well as unstructured data like images, text or audio. However, it also tries to keep a simple interface hiding most of its functionality after one simple find_issues function.

It also allows for interactive reporting and exploration of found data issues using Renumics Spotlight.

⏱️ Quickstart

Install sliceguard by running pip install sliceguard.

Download the Example Dataset.

Install the jiwer package for computing the word error rate metric using pip install jiwer

Get started by loading your first dataset and let sliceguard do its work:

import pandas as pd
import numpy as np
from jiwer import wer
from sliceguard import SliceGuard

# Load the example data
df = pd.read_json("example_data.json")

# Define a metric function to evaluate your model
def wer_metric(y_true, y_pred):
    return np.mean([wer(s_y, s_pred) for s_y, s_pred in zip(y_true, y_pred)])

# Detect problematic data slices using the features age, gender and accent
sg = SliceGuard()
issue_df = sg.find_issues(
    df,
    ["age", "gender", "accent"],
    "sentence",
    "prediction",
    wer_metric,
    metric_mode="min"
)
sg.report()

🔧 Use case-specific examples

Also check this post on Medium:

Evaluating automatic speech recognition models beyond global metrics — A tutorial using OpenAI’s Whisper as an example

🗺️ Public Roadmap

  • Detection of problematic data slices
  • Basic explanation of found issues via feature importances
  • Limited embedding computation for images, audio, text
  • Extended embedding support, e.g., more embedding models and allow precomputed embeddings
  • Speed up embedding computation using datasets library
  • Soft Dependencies for embedding computation as torch dependencies are large
  • Improve Spotlight report with embeddings in simmap and histogram for univariate analysis
  • Extensive documentation and examples for common cases
  • Data connectors for faster application on common data formats
  • Improved explanations for found issues, e.g., via SHAP
  • Generation of a summary report doing predefined checks
  • Allow for control features in order to account for expected variations when running checks
  • Improved issue detection algorithm, avoiding duplicate detections of similar problems and outliers influencing the segment detection

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sliceguard-0.0.7.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sliceguard-0.0.7-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file sliceguard-0.0.7.tar.gz.

File metadata

  • Download URL: sliceguard-0.0.7.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for sliceguard-0.0.7.tar.gz
Algorithm Hash digest
SHA256 c185c3aaa448671c23f00826b3fd6452ca122f9270f84ac7b4d4aa0800c97579
MD5 0f515947b328a9ea399e129ef3994ec7
BLAKE2b-256 f9634dc584c90b2617b333a8b80f23c46e547d495e80173daf0e0b8999194e40

See more details on using hashes here.

File details

Details for the file sliceguard-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: sliceguard-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for sliceguard-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7de3710eadf320e78e5f5103d78dab483c2fce34b4f37f1deb3e74a45f357f97
MD5 3285654936048df65c45d229ea99a2e0
BLAKE2b-256 dcc2a8efae2ab5296cfb3a192f1cd5526a6add607792fe7499e7a835d7cb925e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page