A library for detecting critical data slices in structured and unstructured data based on features, metadata and model predictions.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

sliceguard

Detect problematic data slices in unstructured and structured data fast.

🚀 Introduction

sliceguard is built to quickly discover problematic data segments in your data. It aims at supporting structured data as well as unstructured data like images, text or audio. However, it also tries to keep a simple interface hiding most of its functionality after one simple find_issues function.

It also allows for interactive reporting and exploration of found data issues using Renumics Spotlight.

⏱️ Quickstart

Install sliceguard by running pip install sliceguard.

Download the Example Dataset.

Install the jiwer package for computing the word error rate metric using pip install jiwer

Get started by loading your first dataset and let sliceguard do its work:

import pandas as pd
import numpy as np
from jiwer import wer
from sliceguard import SliceGuard

# Load the example data
df = pd.read_json("example_data.json")

# Define a metric function to evaluate your model
def wer_metric(y_true, y_pred):
    return np.mean([wer(s_y, s_pred) for s_y, s_pred in zip(y_true, y_pred)])

# Detect problematic data slices using the features age, gender and accent
sg = SliceGuard()
issue_df = sg.find_issues(
    df,
    ["age", "gender", "accent"],
    "sentence",
    "prediction",
    wer_metric,
    metric_mode="min"
)
sg.report()

🔧 Use case-specific examples

Also check this post on Medium:

Evaluating automatic speech recognition models beyond global metrics — A tutorial using OpenAI’s Whisper as an example

🗺️ Public Roadmap

Detection of problematic data slices
Basic explanation of found issues via feature importances
Limited embedding computation for images, audio, text
Extended embedding support, e.g., more embedding models and allow precomputed embeddings
Speed up embedding computation using datasets library
Soft Dependencies for embedding computation as torch dependencies are large
Improve Spotlight report with embeddings in simmap and histogram for univariate analysis
Extensive documentation and examples for common cases
Data connectors for faster application on common data formats
Improved explanations for found issues, e.g., via SHAP
Generation of a summary report doing predefined checks
Allow for control features in order to account for expected variations when running checks
Improved issue detection algorithm, avoiding duplicate detections of similar problems and outliers influencing the segment detection

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.35

Dec 4, 2023

0.0.34

Nov 28, 2023

0.0.33

Nov 28, 2023

0.0.32

Nov 1, 2023

0.0.31

Oct 20, 2023

0.0.30

Sep 20, 2023

0.0.29

Sep 18, 2023

0.0.28

Sep 18, 2023

0.0.27

Sep 18, 2023

0.0.26

Sep 8, 2023

0.0.25

Sep 7, 2023

0.0.24

Sep 6, 2023

0.0.23

Sep 6, 2023

0.0.22

Aug 28, 2023

0.0.21

Aug 28, 2023

0.0.20

Aug 28, 2023

0.0.19

Aug 28, 2023

0.0.18

Aug 23, 2023

0.0.17

Aug 23, 2023

0.0.16

Aug 23, 2023

0.0.15

Aug 22, 2023

0.0.14

Aug 18, 2023

0.0.13

Aug 17, 2023

0.0.12

Aug 17, 2023

0.0.11

Aug 14, 2023

0.0.10

Jul 25, 2023

0.0.9

Jul 21, 2023

0.0.8

Jul 19, 2023

This version

0.0.7

Jul 13, 2023

0.0.6

Jul 11, 2023

0.0.5

Jul 7, 2023

0.0.4

Jul 7, 2023

0.0.3

Jul 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sliceguard-0.0.7.tar.gz (15.5 kB view hashes)

Uploaded Jul 13, 2023 Source

Built Distribution

sliceguard-0.0.7-py3-none-any.whl (14.8 kB view hashes)

Uploaded Jul 13, 2023 Python 3

Hashes for sliceguard-0.0.7.tar.gz

Hashes for sliceguard-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`c185c3aaa448671c23f00826b3fd6452ca122f9270f84ac7b4d4aa0800c97579`
MD5	`0f515947b328a9ea399e129ef3994ec7`
BLAKE2b-256	`f9634dc584c90b2617b333a8b80f23c46e547d495e80173daf0e0b8999194e40`

Hashes for sliceguard-0.0.7-py3-none-any.whl

Hashes for sliceguard-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7de3710eadf320e78e5f5103d78dab483c2fce34b4f37f1deb3e74a45f357f97`
MD5	`3285654936048df65c45d229ea99a2e0`
BLAKE2b-256	`dcc2a8efae2ab5296cfb3a192f1cd5526a6add607792fe7499e7a835d7cb925e`