Skip to main content

A library for detecting critical data slices in structured and unstructured data based on features, metadata and model predictions.

Project description

Gray shape shifter

sliceguard

Detect problematic data slices in unstructured and structured data fast.

🚀 Introduction

sliceguard is built to quickly discover problematic data segments in your data. It aims at supporting structured data as well as unstructured data like images, text or audio. However, it also tries to keep a simple interface hiding most of its functionality behind one simple find_issues function.

It also allows for interactive reporting and exploration of found data issues using Renumics Spotlight.

⏱️ Quickstart

Install sliceguard by running pip install sliceguard.

Go straight to our quickstart examples for your use case:

🔧 Use case-specific examples

🗺️ Public Roadmap

  • Detection of problematic data slices
  • Basic explanation of found issues via feature importances
  • Limited embedding computation for images, audio, text
  • Extended embedding support, e.g., more embedding models and allow precomputed embeddings
  • Speed up embedding computation using datasets library
  • Improved issue detection algorithm, avoiding duplicate detections of similar problems and outliers influencing the segment detection
  • Support application on datasets without labels (outlier based)
  • Adaptive drop reference for datasets that contain a wide variety of data
  • Large data support for detection and reporting, e.g., 500k audio samples with transcriptions
  • Different interfaces from min_drop, min_support. Maybe n_slices and sort by criterion?
  • Support application without model (by training simple baseline model)
  • Improve normalization for mixed type runs e.g. embedding + one categorical or numeric variable.
  • Walthroughs for unstructured, structured and mixed data. Also, in depth tutorial explaining all the parameters.
  • Soft Dependencies for embedding computation as torch dependencies are large
  • Allow for model comparisons via intersection, difference, ...
  • Robustify outlier detection algorithm. Probably better parameter choice.
  • Interpretable features for images, audio, text. E.g., dark image, quiet audio, long audio, contains common word x, ...
  • Generation of a summary report doing predefined checks
  • "Supervised" clustering that incorporates classes, probabilities, metrics, not only features
  • Data connectors for faster application on common data formats
  • Support embedding generation for remote resources, e.g. audio/images hosted on webservers
  • Improved explanations for found issues, e.g., via SHAP

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sliceguard-0.0.15.tar.gz (24.2 kB view hashes)

Uploaded Source

Built Distribution

sliceguard-0.0.15-py3-none-any.whl (23.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page