Find and fix issues in your ML training data - duplicates, label errors, outliers, and more

These details have not been verified by PyPI

Project links

Project description

Training Data Debugger

Find and fix issues in your ML training data before they degrade model performance.

Training Data Debugger automatically identifies data quality issues that commonly cause ML models to underperform, including duplicates, label errors, outliers, class imbalance, missing values, and PII exposure.

Features

🔍 Duplicate Detection - Find exact and near-duplicate samples using hash-based and n-gram similarity
🏷️ Label Error Detection - Identify mislabeled samples and label inconsistencies
📊 Outlier Detection - Detect statistical outliers using multiple methods
⚖️ Class Imbalance Analysis - Measure imbalance ratios and get rebalancing suggestions
🕳️ Missing Value Analysis - Find null, empty, and placeholder values
📝 Text Quality Checks - Detect encoding errors, length anomalies, and quality issues
🔒 PII Exposure Detection - Find exposed emails, phone numbers, SSNs in your data
📈 Comprehensive Statistics - Get detailed dataset and column-level statistics
🎯 Actionable Suggestions - Receive specific recommendations for each issue

Installation

# Basic installation (zero dependencies)
pip install training-data-debugger

# With ML capabilities (numpy, pandas, scikit-learn)
pip install training-data-debugger[ml]

# With visualization (matplotlib, seaborn)
pip install training-data-debugger[viz]

# Full installation
pip install training-data-debugger[full]

Quick Start

Basic Usage

from training_data_debugger import DataDebugger

# Your training data
texts = [
    "The product is great!",
    "The product is great!",  # Duplicate
    "Terrible experience, never buying again",
    "Contact me at john@email.com",  # PII
    "",  # Empty
    "Love it! ⭐⭐⭐⭐⭐",
]
labels = ["positive", "negative", "negative", "positive", "positive", "positive"]
#                      ^ Wrong label (duplicate with different label)

# Debug your data
debugger = DataDebugger()
results = debugger.debug(texts, labels)

# View results
print(f"Total issues found: {results.total_issues}")
print(f"Critical issues: {results.critical_count}")

for issue in results.issues:
    print(f"[{issue.severity.value}] {issue.issue_type.value}: {issue.description}")

Text Data Debugging

from training_data_debugger import TextDataDebugger

debugger = TextDataDebugger()

# Debug text classification data
texts = [
    "Great movie, loved every minute!",
    "Great movie, loved every minute!",  # Exact duplicate
    "Gret moive, lovd evry minite!",     # Near duplicate (typos)
    "Bad",                                # Too short
    "Contact: test@email.com 555-1234",  # PII
    "",                                   # Empty
]
labels = ["positive", "negative", "positive", "positive", "neutral", "positive"]

results = debugger.debug(texts, labels)

# Access specific issue types
duplicates = results.get_issues_by_type(IssueType.DUPLICATE)
label_errors = results.get_issues_by_type(IssueType.LABEL_ERROR)
pii_issues = results.get_issues_by_type(IssueType.PII_EXPOSURE)

Tabular Data Debugging

from training_data_debugger import TabularDataDebugger

debugger = TabularDataDebugger()

# Debug tabular data (as list of dicts)
data = [
    {"age": 25, "income": 50000, "city": "NYC", "target": 1},
    {"age": 25, "income": 50000, "city": "NYC", "target": 0},  # Duplicate, different label
    {"age": 150, "income": 50000, "city": "LA", "target": 1},  # Outlier age
    {"age": None, "income": 75000, "city": "", "target": 1},   # Missing values
    {"age": 30, "income": -5000, "city": "Chicago", "target": 0},  # Invalid income
]

results = debugger.debug(data, label_column="target")

# Get statistics
print(f"Dataset size: {results.statistics.total_samples}")
print(f"Missing value rate: {results.statistics.missing_rate:.2%}")

Configuration

Customize detection thresholds and enable/disable specific checks:

from training_data_debugger import DataDebugger, DebugConfig

config = DebugConfig(
    # Detection toggles
    check_duplicates=True,
    check_label_errors=True,
    check_outliers=True,
    check_class_imbalance=True,
    check_missing_values=True,
    check_text_quality=True,
    check_pii=True,
    
    # Thresholds
    duplicate_threshold=0.9,      # Similarity threshold for near-duplicates
    outlier_std_threshold=3.0,    # Standard deviations for outlier detection
    imbalance_threshold=0.1,      # Minority class ratio threshold
    min_text_length=10,           # Minimum acceptable text length
    max_text_length=10000,        # Maximum acceptable text length
    
    # N-gram settings for near-duplicate detection
    ngram_size=3,
    
    # Severity configuration
    duplicate_severity=IssueSeverity.HIGH,
    label_error_severity=IssueSeverity.CRITICAL,
)

debugger = DataDebugger(config=config)
results = debugger.debug(texts, labels)

Issue Types

Issue Type	Description	Default Severity
`DUPLICATE`	Exact duplicate samples	HIGH
`NEAR_DUPLICATE`	Similar samples (n-gram based)	MEDIUM
`LABEL_ERROR`	Same content with different labels	CRITICAL
`LABEL_INCONSISTENCY`	Inconsistent labeling patterns	HIGH
`OUTLIER`	Statistical outliers	MEDIUM
`CLASS_IMBALANCE`	Imbalanced class distribution	MEDIUM
`MISSING_VALUE`	Null or empty values	HIGH
`TEXT_TOO_SHORT`	Text below minimum length	LOW
`TEXT_TOO_LONG`	Text above maximum length	LOW
`ENCODING_ERROR`	Character encoding issues	MEDIUM
`PII_EXPOSURE`	Personal information exposed	CRITICAL
`INVALID_FORMAT`	Data format issues	MEDIUM
`DISTRIBUTION_SHIFT`	Feature distribution anomalies	MEDIUM

Working with Results

Filtering Issues

from training_data_debugger import IssueType, IssueSeverity

# Get issues by type
duplicates = results.get_issues_by_type(IssueType.DUPLICATE)

# Get issues by severity
critical = results.get_issues_by_severity(IssueSeverity.CRITICAL)
high_priority = results.get_issues_by_severity(IssueSeverity.HIGH)

# Get affected indices
affected_indices = results.get_affected_indices()

Generating Reports

# JSON report
json_report = results.to_json(indent=2)

# Dictionary format
report_dict = results.to_dict()

# Summary statistics
print(f"Total samples: {results.statistics.total_samples}")
print(f"Unique samples: {results.statistics.unique_samples}")
print(f"Duplicate rate: {results.statistics.duplicate_rate:.2%}")
print(f"Missing rate: {results.statistics.missing_rate:.2%}")
print(f"Label distribution: {results.statistics.label_distribution}")

Getting Recommendations

# Each issue includes suggestions
for issue in results.issues:
    print(f"Issue: {issue.description}")
    print(f"Suggestion: {issue.suggestion}")
    print(f"Affected indices: {issue.indices}")
    print("---")

Advanced Usage

Custom Issue Detection

from training_data_debugger import DataDebugger, DataIssue, IssueType, IssueSeverity

class CustomDebugger(DataDebugger):
    def _detect_custom_issues(self, data, labels):
        """Add custom detection logic."""
        issues = []
        
        for i, text in enumerate(data):
            # Custom check: detect specific patterns
            if "CONFIDENTIAL" in str(text).upper():
                issues.append(DataIssue(
                    issue_type=IssueType.PII_EXPOSURE,
                    severity=IssueSeverity.CRITICAL,
                    indices=[i],
                    description=f"Confidential marker found at index {i}",
                    suggestion="Remove or redact confidential content",
                    details={"pattern": "CONFIDENTIAL"}
                ))
        
        return issues
    
    def debug(self, data, labels=None):
        results = super().debug(data, labels)
        custom_issues = self._detect_custom_issues(data, labels)
        results.issues.extend(custom_issues)
        return results

Batch Processing

from training_data_debugger import DataDebugger

debugger = DataDebugger()

# Process large datasets in batches
def debug_in_batches(data, labels, batch_size=10000):
    all_issues = []
    
    for i in range(0, len(data), batch_size):
        batch_data = data[i:i+batch_size]
        batch_labels = labels[i:i+batch_size] if labels else None
        
        results = debugger.debug(batch_data, batch_labels)
        
        # Adjust indices for batch offset
        for issue in results.issues:
            issue.indices = [idx + i for idx in issue.indices]
        
        all_issues.extend(results.issues)
    
    return all_issues

Integration with Pandas

import pandas as pd
from training_data_debugger import TabularDataDebugger

# Load your data
df = pd.read_csv("training_data.csv")

# Convert to list of dicts for debugging
data = df.to_dict(orient="records")

debugger = TabularDataDebugger()
results = debugger.debug(data, label_column="target")

# Get clean indices
problematic_indices = results.get_affected_indices()
clean_df = df.drop(index=problematic_indices)

print(f"Removed {len(problematic_indices)} problematic samples")
print(f"Clean dataset size: {len(clean_df)}")

Statistics Reference

Dataset Statistics

stats = results.statistics

# Basic counts
stats.total_samples      # Total number of samples
stats.unique_samples     # Number of unique samples
stats.duplicate_count    # Number of duplicates
stats.missing_count      # Number of missing values

# Rates
stats.duplicate_rate     # Proportion of duplicates
stats.missing_rate       # Proportion of missing values

# Label information
stats.label_distribution # Dict of label -> count
stats.class_weights      # Suggested weights for imbalance

# Text statistics (if applicable)
stats.avg_text_length    # Average text length
stats.min_text_length    # Minimum text length
stats.max_text_length    # Maximum text length

Column Statistics (Tabular Data)

for col_name, col_stats in results.column_statistics.items():
    print(f"Column: {col_name}")
    print(f"  Type: {col_stats.data_type}")
    print(f"  Missing: {col_stats.missing_count}")
    print(f"  Unique: {col_stats.unique_count}")
    print(f"  Mean: {col_stats.mean}")
    print(f"  Std: {col_stats.std}")

Best Practices

Run Early: Debug your data before training to catch issues early
Prioritize Critical Issues: Address CRITICAL and HIGH severity issues first
Iterative Cleaning: Run the debugger multiple times as you fix issues
Document Decisions: Keep track of why certain samples were removed/modified
Validate Fixes: Verify that fixes don't introduce new problems

API Reference

Classes

DataDebugger - Main debugger for general data
TextDataDebugger - Specialized debugger for text data
TabularDataDebugger - Specialized debugger for tabular data
DebugConfig - Configuration for detection thresholds
DebugResults - Container for debugging results
DataIssue - Individual issue representation

Enums

IssueType - Types of data issues
IssueSeverity - Issue severity levels (CRITICAL, HIGH, MEDIUM, LOW, INFO)
DataType - Data type classifications

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE for details.

Author

Created by Pranay M

Found an issue? Report it on GitHub

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

training_data_debugger-0.1.0.tar.gz (16.7 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

training_data_debugger-0.1.0-py3-none-any.whl (15.7 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file training_data_debugger-0.1.0.tar.gz.

File metadata

Download URL: training_data_debugger-0.1.0.tar.gz
Upload date: Jan 1, 2026
Size: 16.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for training_data_debugger-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`098db35e1b0480d86522136b04d019dca918545f575a9b6ef5d706fa2a16929d`
MD5	`b3f6df396055600111af46fe79573ea3`
BLAKE2b-256	`ed76ef62cb238615466e34a62495123a0ff7da585995f38396314d2bacdf7d96`

See more details on using hashes here.

File details

Details for the file training_data_debugger-0.1.0-py3-none-any.whl.

File metadata

Download URL: training_data_debugger-0.1.0-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for training_data_debugger-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63e7e916b68a24746272e94d68ca1afa576d7f1be5782e5756f647c5fd9e221f`
MD5	`8cf747aa918fe0b60fbb5273d5fda12e`
BLAKE2b-256	`0d3ade2cec750b9eb192cec7a20df99264cbfc85b6dec3dda546d6d473235372`

See more details on using hashes here.

training-data-debugger 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Training Data Debugger

Features

Installation

Quick Start

Basic Usage

Text Data Debugging

Tabular Data Debugging

Configuration

Issue Types

Working with Results

Filtering Issues

Generating Reports

Getting Recommendations

Advanced Usage

Custom Issue Detection

Batch Processing

Integration with Pandas

Statistics Reference

Dataset Statistics

Column Statistics (Tabular Data)

Best Practices

API Reference

Classes

Enums

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes