Automated data validation for ML teams
Project description
DataLint
Automated data validation for ML teams
Find data quality issues before they break your models.
Overview
DataLint learns from clean datasets to automatically validate new data and prevent ML training failures. It catches the data quality issues that cause 60% of ML project failures before they break your models.
Key Features
| Feature | Description |
|---|---|
| Zero Configuration | Works out of the box with sensible defaults |
| ML-Focused | Optimized specifically for model training data quality |
| Learn from Data | Automatically generates validation rules from clean datasets |
| Schema Drift Detection | Catches when production data differs from training data |
| CI/CD Ready | JSON output for integration with automated pipelines |
Installation
pip install datalint
Requirements: Python 3.8+
Quick Start
Validate a Dataset
datalint validate mydata.csv
Output:
Loaded dataset: 150 rows x 5 columns
missing_values: No missing values found
data_types: Data types appear consistent
outliers: Outlier levels appear normal
correlations: Found 1 highly correlated feature pairs
constant_columns: Found 1 columns with constant values
Summary: 3 passed, 1 warnings, 1 failed
Tip: Address failed checks before training ML models
Learn from Clean Data
# Create a validation profile from your training data
datalint profile training_data.csv --learn
# Validate new data against the learned profile
datalint profile new_data.csv --profile training_data_profile.json
Export for CI/CD
datalint validate data.csv --format json --output results.json
What It Checks
DataLint performs five core validation checks:
1. Missing Values
Identifies columns with excessive null values that will crash or degrade ML models.
# Example: 43% missing values in 'age' column
# Recommendation: Impute or remove before training
2. Data Type Consistency
Detects mixed types (e.g., numbers and text in the same column) that cause parsing errors.
# Example: price column has [10.99, 25.50, 'FREE', 15.00]
# Recommendation: Convert to consistent type
3. Outlier Detection
Uses the IQR (Interquartile Range) method to find statistical anomalies that can dominate model training.
# Example: salary column has values [50k, 55k, 48k, 5M]
# Recommendation: Investigate or cap extreme values
4. High Correlations
Finds feature pairs with >95% correlation that provide redundant information.
# Example: height_cm and height_inches are 100% correlated
# Recommendation: Remove one redundant feature
5. Constant Columns
Detects columns with zero variance that provide no predictive information.
# Example: 'country' column is 'USA' for all rows
# Recommendation: Remove before training
Comparison with Other Tools
| Feature | DataLint | Great Expectations | Pandera | Deequ |
|---|---|---|---|---|
| Zero config | Yes | No (YAML required) | No (schema required) | No |
| Auto-learn rules | Yes | No | No | Partial |
| ML-focused | Yes | General | General | General |
| Setup time | 5 minutes | Hours/Days | Hours | Hours |
| Pricing | Free | Free | Free | Free (AWS) |
Architecture
datalint/
├── cli.py # Command-line interface
├── engine/
│ ├── validators.py # Core validation checks
│ ├── learner.py # Rule learning from clean data
│ └── profiler.py # Statistical profiling
└── utils/
├── io.py # File loading (CSV, Excel, Parquet)
└── reporting.py # Output formatter (text, JSON, HTML)
Architecture Diagrams
Class Diagram
Shows the class hierarchy and relationships
classDiagram
class BaseValidator {
<<abstract>>
+String name*
+ValidationResult validate(DataFrame df)*
+String __repr__()
}
class Formatter {
<<abstract>>
+String format(List~ValidationResult~ results)*
}
class ValidationResult {
+String name
+String status
+String message
+List issues
+List recommendations
+Dict details
+Boolean passed
+Dict to_dict()
}
class ValidationRunner {
-List~BaseValidator~ validators
+ValidationRunner(List~BaseValidator~ validators)
+void add_validator(BaseValidator validator)
+List~ValidationResult~ run(DataFrame df)
+Dict~String,ValidationResult~ run_dict(DataFrame df)
}
class ConcreteValidator {
+String name
+ValidationResult validate(DataFrame df)
}
class ConcreteFormatter {
+String format(List~ValidationResult~ results)
}
BaseValidator <|.. ConcreteValidator : implements
Formatter <|.. ConcreteFormatter : implements
ValidationRunner --> BaseValidator : uses
BaseValidator --> ValidationResult : returns
ConcreteValidator --> ValidationResult : returns
Interface Diagram
Shows key interfaces and abstraction contracts
classDiagram
class BaseValidator {
<<abstract>>
+name: str*
+validate(df: DataFrame): ValidationResult*
}
class Formatter {
<<abstract>>
+format(results: List[ValidationResult]): str*
}
class ValidationResult {
+name: str
+status: Literal['passed', 'warning', 'failed']
+message: str
+issues: List
+recommendations: List
+details: Dict
+passed: bool
+to_dict(): Dict
}
class ValidationRunner {
-validators: List[BaseValidator]
+__init__(validators=None)
+add_validator(validator: BaseValidator)
+run(df: DataFrame): List[ValidationResult]
+run_dict(df: DataFrame): Dict[str, ValidationResult]
}
BaseValidator <|.. ConcreteValidator : implements
Formatter <|.. ConcreteFormatter : implements
ValidationRunner --> BaseValidator : uses
BaseValidator --> ValidationResult : returns
Component Diagram
Illustrates high-level software components
graph TD
CLI[Command Line Interface]
ENG[Core Validation Engine]
UTI[Utility Functions]
CLI --> ENG
CLI --> UTI
ENG --> UTI
Deployment Diagram
Shows how the system is deployed
graph TD
subgraph Local[Local Machine]
Python[Python Environment]
DataLint[DataLint Package]
end
Data[Data Files]
Reports[Output Reports]
DataLint --> Data
DataLint --> Reports
Python --> DataLint
Sequence Diagram
Displays the validation workflow sequence
sequenceDiagram
participant U as User
participant C as CLI
participant V as ValidationRunner
participant B as BaseValidator
participant D as DataFrame
U->>C: datalint validate file.csv
C->>V: run(df)
loop for each validator
V->>B: validate(df)
B->>D: analyze data
D-->>B: return analysis
B-->>V: ValidationResult
end
V-->>C: results list
C-->>U: formatted output
Activity Diagram
Shows the validation pipeline activities
flowchart TD
Start([Start])
Run[User runs datalint validate]
Parse[Parse command line arguments]
Load[Load data file]
Check{File loaded successfully?}
Init[Initialize ValidationRunner]
Validate[Run all validators]
CheckResult{Validation passed?}
Success[Generate success report]
Fail[Generate failure report]
Recomm[Show recommendations]
Error[Show error message]
Exit([Exit])
Start --> Run
Run --> Parse
Parse --> Load
Load --> Check
Check -->|Yes| Init
Init --> Validate
Validate --> CheckResult
CheckResult -->|Yes| Success
CheckResult -->|No| Fail
Fail --> Recomm
Success --> Exit
Recomm --> Exit
Check -->|No| Error
Error --> Exit
Use Case Diagram
Illustrates user interactions with the system
flowchart LR
DS([Data Scientist])
MLE([ML Engineer])
DevOps([DevOps Engineer])
UC1[Validate Dataset]
UC2[Learn from Clean Data]
UC3[Profile Data Quality]
UC4[Generate Reports]
UC5[CI/CD Integration]
DS --> UC1
DS --> UC2
MLE --> UC3
DevOps --> UC5
UC1 --> UC4
UC2 --> UC4
UC3 --> UC4
Roadmap
- Phase 1: Core validation engine with CLI
- Phase 2: Learning system (profile command with
--learnand--profile) - Phase 3: HTML reports + GitHub Actions integration
- Phase 4: Web dashboard + team collaboration
Contributing
DataLint is in active development. We welcome contributions:
- Bug Reports: Open an issue with reproduction steps
- Feature Requests: Describe your use case
- Pull Requests: See
CONTRIBUTING.mdfor guidelines - Feedback: Share your experience using DataLint
License
MIT License - see LICENSE for details.
DataLint - Because good models start with good data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datalint-0.1.0.tar.gz.
File metadata
- Download URL: datalint-0.1.0.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6732dccef199763dae728de5c074871c4e914007736f9c65e76dd56293884378
|
|
| MD5 |
05698cb048ea64a2aa54422157884241
|
|
| BLAKE2b-256 |
0cb07cc339e2e30f521adae580fea8178d9386982e68156de3741dca457f1e60
|
File details
Details for the file datalint-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datalint-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
455bc03d8f84f49335a48987a9e75cacb9f3e876c4672cf4e250e4889ffacf03
|
|
| MD5 |
a25882f767c53ce6d071c5d624a30224
|
|
| BLAKE2b-256 |
947ea8e20deb679efd116be772d8334eb93a6e27a8d8d5887a9e3e36663d604f
|