Find and fix issues in your ML training data - duplicates, label errors, outliers, and more
Project description
Training Data Debugger
Find and fix issues in your ML training data before they degrade model performance.
Training Data Debugger automatically identifies data quality issues that commonly cause ML models to underperform, including duplicates, label errors, outliers, class imbalance, missing values, and PII exposure.
Features
- 🔍 Duplicate Detection - Find exact and near-duplicate samples using hash-based and n-gram similarity
- 🏷️ Label Error Detection - Identify mislabeled samples and label inconsistencies
- 📊 Outlier Detection - Detect statistical outliers using multiple methods
- ⚖️ Class Imbalance Analysis - Measure imbalance ratios and get rebalancing suggestions
- 🕳️ Missing Value Analysis - Find null, empty, and placeholder values
- 📝 Text Quality Checks - Detect encoding errors, length anomalies, and quality issues
- 🔒 PII Exposure Detection - Find exposed emails, phone numbers, SSNs in your data
- 📈 Comprehensive Statistics - Get detailed dataset and column-level statistics
- 🎯 Actionable Suggestions - Receive specific recommendations for each issue
Installation
# Basic installation (zero dependencies)
pip install training-data-debugger
# With ML capabilities (numpy, pandas, scikit-learn)
pip install training-data-debugger[ml]
# With visualization (matplotlib, seaborn)
pip install training-data-debugger[viz]
# Full installation
pip install training-data-debugger[full]
Quick Start
Basic Usage
from training_data_debugger import DataDebugger
# Your training data
texts = [
"The product is great!",
"The product is great!", # Duplicate
"Terrible experience, never buying again",
"Contact me at john@email.com", # PII
"", # Empty
"Love it! ⭐⭐⭐⭐⭐",
]
labels = ["positive", "negative", "negative", "positive", "positive", "positive"]
# ^ Wrong label (duplicate with different label)
# Debug your data
debugger = DataDebugger()
results = debugger.debug(texts, labels)
# View results
print(f"Total issues found: {results.total_issues}")
print(f"Critical issues: {results.critical_count}")
for issue in results.issues:
print(f"[{issue.severity.value}] {issue.issue_type.value}: {issue.description}")
Text Data Debugging
from training_data_debugger import TextDataDebugger
debugger = TextDataDebugger()
# Debug text classification data
texts = [
"Great movie, loved every minute!",
"Great movie, loved every minute!", # Exact duplicate
"Gret moive, lovd evry minite!", # Near duplicate (typos)
"Bad", # Too short
"Contact: test@email.com 555-1234", # PII
"", # Empty
]
labels = ["positive", "negative", "positive", "positive", "neutral", "positive"]
results = debugger.debug(texts, labels)
# Access specific issue types
duplicates = results.get_issues_by_type(IssueType.DUPLICATE)
label_errors = results.get_issues_by_type(IssueType.LABEL_ERROR)
pii_issues = results.get_issues_by_type(IssueType.PII_EXPOSURE)
Tabular Data Debugging
from training_data_debugger import TabularDataDebugger
debugger = TabularDataDebugger()
# Debug tabular data (as list of dicts)
data = [
{"age": 25, "income": 50000, "city": "NYC", "target": 1},
{"age": 25, "income": 50000, "city": "NYC", "target": 0}, # Duplicate, different label
{"age": 150, "income": 50000, "city": "LA", "target": 1}, # Outlier age
{"age": None, "income": 75000, "city": "", "target": 1}, # Missing values
{"age": 30, "income": -5000, "city": "Chicago", "target": 0}, # Invalid income
]
results = debugger.debug(data, label_column="target")
# Get statistics
print(f"Dataset size: {results.statistics.total_samples}")
print(f"Missing value rate: {results.statistics.missing_rate:.2%}")
Configuration
Customize detection thresholds and enable/disable specific checks:
from training_data_debugger import DataDebugger, DebugConfig
config = DebugConfig(
# Detection toggles
check_duplicates=True,
check_label_errors=True,
check_outliers=True,
check_class_imbalance=True,
check_missing_values=True,
check_text_quality=True,
check_pii=True,
# Thresholds
duplicate_threshold=0.9, # Similarity threshold for near-duplicates
outlier_std_threshold=3.0, # Standard deviations for outlier detection
imbalance_threshold=0.1, # Minority class ratio threshold
min_text_length=10, # Minimum acceptable text length
max_text_length=10000, # Maximum acceptable text length
# N-gram settings for near-duplicate detection
ngram_size=3,
# Severity configuration
duplicate_severity=IssueSeverity.HIGH,
label_error_severity=IssueSeverity.CRITICAL,
)
debugger = DataDebugger(config=config)
results = debugger.debug(texts, labels)
Issue Types
| Issue Type | Description | Default Severity |
|---|---|---|
DUPLICATE |
Exact duplicate samples | HIGH |
NEAR_DUPLICATE |
Similar samples (n-gram based) | MEDIUM |
LABEL_ERROR |
Same content with different labels | CRITICAL |
LABEL_INCONSISTENCY |
Inconsistent labeling patterns | HIGH |
OUTLIER |
Statistical outliers | MEDIUM |
CLASS_IMBALANCE |
Imbalanced class distribution | MEDIUM |
MISSING_VALUE |
Null or empty values | HIGH |
TEXT_TOO_SHORT |
Text below minimum length | LOW |
TEXT_TOO_LONG |
Text above maximum length | LOW |
ENCODING_ERROR |
Character encoding issues | MEDIUM |
PII_EXPOSURE |
Personal information exposed | CRITICAL |
INVALID_FORMAT |
Data format issues | MEDIUM |
DISTRIBUTION_SHIFT |
Feature distribution anomalies | MEDIUM |
Working with Results
Filtering Issues
from training_data_debugger import IssueType, IssueSeverity
# Get issues by type
duplicates = results.get_issues_by_type(IssueType.DUPLICATE)
# Get issues by severity
critical = results.get_issues_by_severity(IssueSeverity.CRITICAL)
high_priority = results.get_issues_by_severity(IssueSeverity.HIGH)
# Get affected indices
affected_indices = results.get_affected_indices()
Generating Reports
# JSON report
json_report = results.to_json(indent=2)
# Dictionary format
report_dict = results.to_dict()
# Summary statistics
print(f"Total samples: {results.statistics.total_samples}")
print(f"Unique samples: {results.statistics.unique_samples}")
print(f"Duplicate rate: {results.statistics.duplicate_rate:.2%}")
print(f"Missing rate: {results.statistics.missing_rate:.2%}")
print(f"Label distribution: {results.statistics.label_distribution}")
Getting Recommendations
# Each issue includes suggestions
for issue in results.issues:
print(f"Issue: {issue.description}")
print(f"Suggestion: {issue.suggestion}")
print(f"Affected indices: {issue.indices}")
print("---")
Advanced Usage
Custom Issue Detection
from training_data_debugger import DataDebugger, DataIssue, IssueType, IssueSeverity
class CustomDebugger(DataDebugger):
def _detect_custom_issues(self, data, labels):
"""Add custom detection logic."""
issues = []
for i, text in enumerate(data):
# Custom check: detect specific patterns
if "CONFIDENTIAL" in str(text).upper():
issues.append(DataIssue(
issue_type=IssueType.PII_EXPOSURE,
severity=IssueSeverity.CRITICAL,
indices=[i],
description=f"Confidential marker found at index {i}",
suggestion="Remove or redact confidential content",
details={"pattern": "CONFIDENTIAL"}
))
return issues
def debug(self, data, labels=None):
results = super().debug(data, labels)
custom_issues = self._detect_custom_issues(data, labels)
results.issues.extend(custom_issues)
return results
Batch Processing
from training_data_debugger import DataDebugger
debugger = DataDebugger()
# Process large datasets in batches
def debug_in_batches(data, labels, batch_size=10000):
all_issues = []
for i in range(0, len(data), batch_size):
batch_data = data[i:i+batch_size]
batch_labels = labels[i:i+batch_size] if labels else None
results = debugger.debug(batch_data, batch_labels)
# Adjust indices for batch offset
for issue in results.issues:
issue.indices = [idx + i for idx in issue.indices]
all_issues.extend(results.issues)
return all_issues
Integration with Pandas
import pandas as pd
from training_data_debugger import TabularDataDebugger
# Load your data
df = pd.read_csv("training_data.csv")
# Convert to list of dicts for debugging
data = df.to_dict(orient="records")
debugger = TabularDataDebugger()
results = debugger.debug(data, label_column="target")
# Get clean indices
problematic_indices = results.get_affected_indices()
clean_df = df.drop(index=problematic_indices)
print(f"Removed {len(problematic_indices)} problematic samples")
print(f"Clean dataset size: {len(clean_df)}")
Statistics Reference
Dataset Statistics
stats = results.statistics
# Basic counts
stats.total_samples # Total number of samples
stats.unique_samples # Number of unique samples
stats.duplicate_count # Number of duplicates
stats.missing_count # Number of missing values
# Rates
stats.duplicate_rate # Proportion of duplicates
stats.missing_rate # Proportion of missing values
# Label information
stats.label_distribution # Dict of label -> count
stats.class_weights # Suggested weights for imbalance
# Text statistics (if applicable)
stats.avg_text_length # Average text length
stats.min_text_length # Minimum text length
stats.max_text_length # Maximum text length
Column Statistics (Tabular Data)
for col_name, col_stats in results.column_statistics.items():
print(f"Column: {col_name}")
print(f" Type: {col_stats.data_type}")
print(f" Missing: {col_stats.missing_count}")
print(f" Unique: {col_stats.unique_count}")
print(f" Mean: {col_stats.mean}")
print(f" Std: {col_stats.std}")
Best Practices
- Run Early: Debug your data before training to catch issues early
- Prioritize Critical Issues: Address CRITICAL and HIGH severity issues first
- Iterative Cleaning: Run the debugger multiple times as you fix issues
- Document Decisions: Keep track of why certain samples were removed/modified
- Validate Fixes: Verify that fixes don't introduce new problems
API Reference
Classes
DataDebugger- Main debugger for general dataTextDataDebugger- Specialized debugger for text dataTabularDataDebugger- Specialized debugger for tabular dataDebugConfig- Configuration for detection thresholdsDebugResults- Container for debugging resultsDataIssue- Individual issue representation
Enums
IssueType- Types of data issuesIssueSeverity- Issue severity levels (CRITICAL, HIGH, MEDIUM, LOW, INFO)DataType- Data type classifications
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE for details.
Author
Created by Pranay M
Found an issue? Report it on GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file training_data_debugger-0.1.0.tar.gz.
File metadata
- Download URL: training_data_debugger-0.1.0.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
098db35e1b0480d86522136b04d019dca918545f575a9b6ef5d706fa2a16929d
|
|
| MD5 |
b3f6df396055600111af46fe79573ea3
|
|
| BLAKE2b-256 |
ed76ef62cb238615466e34a62495123a0ff7da585995f38396314d2bacdf7d96
|
File details
Details for the file training_data_debugger-0.1.0-py3-none-any.whl.
File metadata
- Download URL: training_data_debugger-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63e7e916b68a24746272e94d68ca1afa576d7f1be5782e5756f647c5fd9e221f
|
|
| MD5 |
8cf747aa918fe0b60fbb5273d5fda12e
|
|
| BLAKE2b-256 |
0d3ade2cec750b9eb192cec7a20df99264cbfc85b6dec3dda546d6d473235372
|