Skip to main content

Automated data validation for ML teams

Project description

DataLint Logo

DataLint

Automated data validation for ML teams
Find data quality issues before they break your models.

Python 3.8+ License: MIT pip install datalint


Overview

DataLint learns from clean datasets to automatically validate new data and prevent ML training failures. It catches the data quality issues that cause 60% of ML project failures before they break your models.

Key Features

Feature Description
Zero Configuration Works out of the box with sensible defaults
ML-Focused Optimized specifically for model training data quality
Learn from Data Automatically generates validation rules from clean datasets
Schema Drift Detection Catches when production data differs from training data
CI/CD Ready JSON output for integration with automated pipelines

Installation

pip install datalint

Requirements: Python 3.8+


Quick Start

Validate a Dataset

datalint validate mydata.csv

Output:

Loaded dataset: 150 rows x 5 columns

  missing_values: No missing values found
  data_types: Data types appear consistent
  outliers: Outlier levels appear normal
  correlations: Found 1 highly correlated feature pairs
  constant_columns: Found 1 columns with constant values

Summary: 3 passed, 1 warnings, 1 failed
Tip: Address failed checks before training ML models

Learn from Clean Data

# Create a validation profile from your training data
datalint profile training_data.csv --learn

# Validate new data against the learned profile
datalint profile new_data.csv --profile training_data_profile.json

Export for CI/CD

datalint validate data.csv --format json --output results.json

What It Checks

DataLint performs five core validation checks:

1. Missing Values

Identifies columns with excessive null values that will crash or degrade ML models.

# Example: 43% missing values in 'age' column
# Recommendation: Impute or remove before training

2. Data Type Consistency

Detects mixed types (e.g., numbers and text in the same column) that cause parsing errors.

# Example: price column has [10.99, 25.50, 'FREE', 15.00]
# Recommendation: Convert to consistent type

3. Outlier Detection

Uses the IQR (Interquartile Range) method to find statistical anomalies that can dominate model training.

# Example: salary column has values [50k, 55k, 48k, 5M]
# Recommendation: Investigate or cap extreme values

4. High Correlations

Finds feature pairs with >95% correlation that provide redundant information.

# Example: height_cm and height_inches are 100% correlated
# Recommendation: Remove one redundant feature

5. Constant Columns

Detects columns with zero variance that provide no predictive information.

# Example: 'country' column is 'USA' for all rows
# Recommendation: Remove before training

Comparison with Other Tools

Feature DataLint Great Expectations Pandera Deequ
Zero config Yes No (YAML required) No (schema required) No
Auto-learn rules Yes No No Partial
ML-focused Yes General General General
Setup time 5 minutes Hours/Days Hours Hours
Pricing Free Free Free Free (AWS)

Architecture

datalint/
├── cli.py              # Command-line interface
├── engine/
│   ├── validators.py   # Core validation checks
│   ├── learner.py      # Rule learning from clean data
│   └── profiler.py     # Statistical profiling
└── utils/
    ├── io.py           # File loading (CSV, Excel, Parquet)
    └── reporting.py    # Output formatter (text, JSON, HTML)

Architecture Diagrams

Class Diagram

Shows the class hierarchy and relationships

classDiagram
    class BaseValidator {
        <<abstract>>
        +String name*
        +ValidationResult validate(DataFrame df)*
        +String __repr__()
    }

    class Formatter {
        <<abstract>>
        +String format(List~ValidationResult~ results)*
    }

    class ValidationResult {
        +String name
        +String status
        +String message
        +List issues
        +List recommendations
        +Dict details
        +Boolean passed
        +Dict to_dict()
    }

    class ValidationRunner {
        -List~BaseValidator~ validators
        +ValidationRunner(List~BaseValidator~ validators)
        +void add_validator(BaseValidator validator)
        +List~ValidationResult~ run(DataFrame df)
        +Dict~String,ValidationResult~ run_dict(DataFrame df)
    }

    class ConcreteValidator {
        +String name
        +ValidationResult validate(DataFrame df)
    }

    class ConcreteFormatter {
        +String format(List~ValidationResult~ results)
    }

    BaseValidator <|.. ConcreteValidator : implements
    Formatter <|.. ConcreteFormatter : implements
    ValidationRunner --> BaseValidator : uses
    BaseValidator --> ValidationResult : returns
    ConcreteValidator --> ValidationResult : returns

Interface Diagram

Shows key interfaces and abstraction contracts

classDiagram
    class BaseValidator {
        <<abstract>>
        +name: str*
        +validate(df: DataFrame): ValidationResult*
    }

    class Formatter {
        <<abstract>>
        +format(results: List[ValidationResult]): str*
    }

    class ValidationResult {
        +name: str
        +status: Literal['passed', 'warning', 'failed']
        +message: str
        +issues: List
        +recommendations: List
        +details: Dict
        +passed: bool
        +to_dict(): Dict
    }

    class ValidationRunner {
        -validators: List[BaseValidator]
        +__init__(validators=None)
        +add_validator(validator: BaseValidator)
        +run(df: DataFrame): List[ValidationResult]
        +run_dict(df: DataFrame): Dict[str, ValidationResult]
    }

    BaseValidator <|.. ConcreteValidator : implements
    Formatter <|.. ConcreteFormatter : implements
    ValidationRunner --> BaseValidator : uses
    BaseValidator --> ValidationResult : returns

Component Diagram

Illustrates high-level software components

graph TD
    CLI[Command Line Interface]
    ENG[Core Validation Engine] 
    UTI[Utility Functions]

    CLI --> ENG
    CLI --> UTI
    ENG --> UTI

Deployment Diagram

Shows how the system is deployed

graph TD
    subgraph Local[Local Machine]
        Python[Python Environment]
        DataLint[DataLint Package]
    end
    Data[Data Files]
    Reports[Output Reports]

    DataLint --> Data
    DataLint --> Reports
    Python --> DataLint

Sequence Diagram

Displays the validation workflow sequence

sequenceDiagram
    participant U as User
    participant C as CLI
    participant V as ValidationRunner
    participant B as BaseValidator
    participant D as DataFrame

    U->>C: datalint validate file.csv
    C->>V: run(df)
    loop for each validator
        V->>B: validate(df)
        B->>D: analyze data
        D-->>B: return analysis
        B-->>V: ValidationResult
    end
    V-->>C: results list
    C-->>U: formatted output

Activity Diagram

Shows the validation pipeline activities

flowchart TD
    Start([Start])
    Run[User runs datalint validate]
    Parse[Parse command line arguments]
    Load[Load data file]
    Check{File loaded successfully?}
    Init[Initialize ValidationRunner]
    Validate[Run all validators]
    CheckResult{Validation passed?}
    Success[Generate success report]
    Fail[Generate failure report]
    Recomm[Show recommendations]
    Error[Show error message]
    Exit([Exit])

    Start --> Run
    Run --> Parse
    Parse --> Load
    Load --> Check
    Check -->|Yes| Init
    Init --> Validate
    Validate --> CheckResult
    CheckResult -->|Yes| Success
    CheckResult -->|No| Fail
    Fail --> Recomm
    Success --> Exit
    Recomm --> Exit
    Check -->|No| Error
    Error --> Exit

Use Case Diagram

Illustrates user interactions with the system

flowchart LR
    DS([Data Scientist])
    MLE([ML Engineer])
    DevOps([DevOps Engineer])

    UC1[Validate Dataset]
    UC2[Learn from Clean Data]
    UC3[Profile Data Quality]
    UC4[Generate Reports]
    UC5[CI/CD Integration]

    DS --> UC1
    DS --> UC2
    MLE --> UC3
    DevOps --> UC5
    UC1 --> UC4
    UC2 --> UC4
    UC3 --> UC4

Roadmap

  • Phase 1: Core validation engine with CLI
  • Phase 2: Learning system (profile command with --learn and --profile)
  • Phase 3: HTML reports + GitHub Actions integration
  • Phase 4: Web dashboard + team collaboration

Contributing

DataLint is in active development. We welcome contributions:

  • Bug Reports: Open an issue with reproduction steps
  • Feature Requests: Describe your use case
  • Pull Requests: See CONTRIBUTING.md for guidelines
  • Feedback: Share your experience using DataLint

License

MIT License - see LICENSE for details.


DataLint - Because good models start with good data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datalint-0.1.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datalint-0.1.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file datalint-0.1.0.tar.gz.

File metadata

  • Download URL: datalint-0.1.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datalint-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6732dccef199763dae728de5c074871c4e914007736f9c65e76dd56293884378
MD5 05698cb048ea64a2aa54422157884241
BLAKE2b-256 0cb07cc339e2e30f521adae580fea8178d9386982e68156de3741dca457f1e60

See more details on using hashes here.

File details

Details for the file datalint-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datalint-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for datalint-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 455bc03d8f84f49335a48987a9e75cacb9f3e876c4672cf4e250e4889ffacf03
MD5 a25882f767c53ce6d071c5d624a30224
BLAKE2b-256 947ea8e20deb679efd116be772d8334eb93a6e27a8d8d5887a9e3e36663d604f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page