A powerful, configuration-driven data processing and cleaning package
Project description
DataTidy
A powerful, configuration-driven data processing and cleaning package for Python with robust fallback capabilities. DataTidy allows you to define complex data transformations, validations, and cleanings through simple YAML configuration files, ensuring 100% reliability in production environments.
🚀 Key Features
- 🔧 Configuration-Driven: Define all transformations in YAML - no code required
- 📊 Multiple Data Sources: CSV, Excel, databases (PostgreSQL, MySQL, Snowflake, etc.)
- 🔗 Multi-Input Joins: Combine data from multiple sources with flexible join operations
- ⚡ Advanced Operations: Map/reduce/filter with lambda functions and chained operations
- 🧠 Dependency Resolution: Automatic execution order planning for complex transformations
- 📈 Time Series Support: Lag operations and rolling window calculations
- 🛡️ Safe Expressions: Secure evaluation with whitelist-based security
- 🎯 Data Validation: Comprehensive validation rules with detailed error reporting
- ⚙️ CLI Interface: Easy-to-use command-line tools for batch processing
🔄 Enhanced Fallback System (v0.1.0)
- 🛡️ 100% Reliability: Dashboard never fails to load data with automatic fallback mechanisms
- ⚖️ Graceful Degradation: Gets sophisticated transformations when possible, basic data when needed
- 🔍 Enhanced Error Logging: Detailed error categorization with actionable debugging suggestions
- 📊 Data Quality Metrics: Compare DataTidy results with fallback data for quality assessment
- 🎛️ Multiple Processing Modes: Strict, partial, and fallback modes for different reliability requirements
- 🔧 Partial Processing: Skip problematic columns while processing successful ones
- 📋 Processing Recommendations: Get specific suggestions for improving configurations
Installation
pip install datatidy
For development installation:
git clone https://github.com/your-repo/datatidy.git
cd datatidy
pip install -e ".[dev]"
Quick Start
1. Create a sample configuration
datatidy sample config.yaml
2. Process your data
datatidy process config.yaml -i input.csv -o output.csv
3. Or use programmatically
from datatidy import DataTidy
# Initialize with configuration
dt = DataTidy('config.yaml')
# Standard processing
result = dt.process_data('input.csv')
# Enhanced processing with fallback
result = dt.process_data_with_fallback('input.csv')
# Save result
dt.process_and_save('output.csv', 'input.csv')
Configuration Structure
DataTidy uses YAML configuration files to define data processing pipelines:
input:
type: csv # csv, excel, database
source: "data/input.csv" # file path or SQL query
options:
encoding: utf-8
delimiter: ","
output:
columns:
user_id:
source: "id" # Source column name
type: int # Data type conversion
validation:
required: true
min_value: 1
full_name:
source: "name"
type: string
transformation: "str.title()" # Python expression
validation:
required: true
min_length: 2
max_length: 100
age_group:
transformation: "'adult' if age >= 18 else 'minor'"
type: string
validation:
allowed_values: ["adult", "minor"]
filters:
- condition: "age >= 0"
action: keep
sort:
- column: user_id
ascending: true
global_settings:
ignore_errors: false
max_errors: 100
# Enhanced fallback settings
processing_mode: partial # strict, partial, or fallback
enable_partial_processing: true
enable_fallback: true
max_column_failures: 5
failure_threshold: 0.3 # 30% failure rate triggers fallback
# Fallback transformations for problematic columns
fallback_transformations:
age_group:
type: default_value
value: "unknown"
Examples
Basic CSV Processing
from datatidy import DataTidy
config = {
"input": {
"type": "csv",
"source": "users.csv"
},
"output": {
"columns": {
"clean_name": {
"source": "name",
"transformation": "str.strip().title()",
"type": "string"
},
"age_category": {
"transformation": "'senior' if age > 65 else ('adult' if age >= 18 else 'minor')",
"type": "string"
}
}
}
}
dt = DataTidy()
dt.load_config(config)
result = dt.process_data()
print(result)
Database Processing
input:
type: database
source:
query: "SELECT * FROM users WHERE active = true"
connection_string: "postgresql://user:pass@localhost/db"
output:
columns:
user_email:
source: "email"
type: string
validation:
pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
signup_date:
source: "created_at"
type: datetime
format: "%Y-%m-%d"
Excel Processing with Complex Transformations
input:
type: excel
source:
path: "sales_data.xlsx"
sheet_name: "Q1_Sales"
options:
header: 0
skiprows: 2
output:
columns:
revenue_category:
transformation: |
'high' if revenue > 100000 else (
'medium' if revenue > 50000 else 'low'
)
validation:
allowed_values: ["high", "medium", "low"]
formatted_date:
source: "sale_date"
type: datetime
format: "%Y-%m-%d"
clean_product_name:
source: "product"
transformation: "str.strip().upper().replace('_', ' ')"
validation:
min_length: 1
max_length: 50
filters:
- condition: "revenue > 0"
action: keep
- condition: "product != 'DELETED'"
action: keep
Enhanced Fallback Processing
Production-Ready Data Processing
from datatidy import DataTidy
# Initialize with fallback-enabled configuration
dt = DataTidy('config.yaml')
# Define fallback database query
def fallback_database_query():
return pd.read_sql("SELECT * FROM facilities", db_connection)
# Process with guaranteed results
result = dt.process_data_with_fallback(
data=input_df,
fallback_query_func=fallback_database_query
)
# Your application always gets data!
if result.fallback_used:
logger.warning("DataTidy processing failed, using database fallback")
# Check processing results
summary = dt.get_processing_summary()
print(f"Success: {summary['success']}")
print(f"Successful columns: {summary['successful_columns']}")
print(f"Failed columns: {summary['failed_columns']}")
# Get improvement recommendations
recommendations = dt.get_processing_recommendations()
for rec in recommendations:
print(f"💡 {rec}")
# Compare data quality when both available
if not result.fallback_used:
fallback_data = fallback_database_query()
quality = dt.compare_with_fallback(fallback_data)
print(f"Overall quality score: {quality.overall_quality_score:.2f}")
Data Quality Monitoring
from datatidy.fallback.metrics import DataQualityMetrics
# Compare processing results
comparison = DataQualityMetrics.compare_results(
datatidy_df=processed_data,
fallback_df=fallback_data,
datatidy_time=2.3,
fallback_time=0.8
)
# Print detailed comparison
DataQualityMetrics.print_comparison_summary(comparison)
# Export for analysis
DataQualityMetrics.export_comparison_report(
comparison,
'quality_report.json'
)
Command Line Usage
Enhanced Processing Modes
# Strict mode (default) - fails on any error
datatidy process config.yaml --mode strict
# Partial mode - skip problematic columns
datatidy process config.yaml --mode partial --show-summary
# Fallback mode - use fallback transformations
datatidy process config.yaml --mode fallback
# Development mode with detailed feedback
datatidy process config.yaml --mode partial \\
--show-summary \\
--show-recommendations \\
--error-log debug.json
Process Data
# Basic processing
datatidy process config.yaml
# With input/output files
datatidy process config.yaml -i input.csv -o output.csv
# Ignore validation errors
datatidy process config.yaml --ignore-errors
Validate Configuration
datatidy validate config.yaml
Create Sample Configuration
datatidy sample my_config.yaml
Expression System
DataTidy includes a safe expression parser that supports:
Basic Operations
- Arithmetic:
+,-,*,/,//,%,** - Comparison:
==,!=,<,<=,>,>= - Logical:
and,or,not - Membership:
in,not in
Functions
- Type conversion:
str(),int(),float(),bool() - Math:
abs(),max(),min(),round() - String methods:
upper(),lower(),strip(),replace(), etc.
Examples
transformations:
# Conditional expressions
status: "'active' if last_login_days < 30 else 'inactive'"
# String operations
clean_name: "name.strip().title()"
# Mathematical calculations
bmi: "weight / (height / 100) ** 2"
# Complex conditions
risk_level: |
'high' if (age > 65 and income < 30000) else (
'medium' if age > 40 else 'low'
)
Validation Rules
DataTidy supports comprehensive validation:
validation:
required: true # Field must not be null
nullable: false # Field cannot be null
min_value: 0 # Minimum numeric value
max_value: 100 # Maximum numeric value
min_length: 2 # Minimum string length
max_length: 50 # Maximum string length
pattern: "^[A-Za-z]+$" # Regex pattern
allowed_values: ["A", "B"] # Whitelist of values
Error Handling
dt = DataTidy('config.yaml')
result = dt.process_data('input.csv')
# Check for errors
if dt.has_errors():
for error in dt.get_errors():
print(f"Error: {error['message']}")
API Reference
DataTidy Class
Core Methods
load_config(config): Load configuration from file or dictprocess_data(data=None): Process data according to configurationprocess_and_save(output_path, data=None): Process and save dataget_errors(): Get list of processing errorshas_errors(): Check if errors occurred
Enhanced Fallback Methods
process_data_with_fallback(data=None, fallback_query_func=None): Process with fallback capabilitiesget_processing_summary(): Get detailed processing summary with metricsget_error_report(): Get categorized error report with debugging infoget_processing_recommendations(): Get actionable recommendations for improvementscompare_with_fallback(fallback_df): Compare DataTidy results with fallback dataexport_error_log(file_path): Export detailed error log to JSONset_processing_mode(mode): Set processing mode (strict, partial, fallback)
Processing Result Class
Properties
success: Boolean indicating overall processing successdata: Processed DataFrame resultprocessing_mode: Mode used for processingsuccessful_columns: List of successfully processed columnsfailed_columns: List of failed columnsfallback_used: Boolean indicating if fallback was activatedprocessing_time: Time taken for processingerror_log: Detailed list of processing errors
Data Quality Metrics
Static Methods
DataQualityMetrics.compare_results(datatidy_df, fallback_df): Compare two DataFramesDataQualityMetrics.print_comparison_summary(comparison): Print formatted comparisonDataQualityMetrics.export_comparison_report(comparison, file_path): Export report to JSON
Configuration Schema
See Configuration Reference for complete schema documentation.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Changelog
Version 0.1.0
- Initial release
- Basic CSV, Excel, and database support
- Safe expression engine
- Comprehensive validation system
- CLI interface
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatidy-0.1.0.tar.gz.
File metadata
- Download URL: datatidy-0.1.0.tar.gz
- Upload date:
- Size: 400.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b9affb296a058e65dd6f0bead3058c5f5179735f68647c1b114690fde844ff5
|
|
| MD5 |
77413b1881a514603f6d96cfde1e6dc6
|
|
| BLAKE2b-256 |
cd1064109e7cb9788d91b69b522911a998738093f93d5cb688fb06dbcf07f58c
|
File details
Details for the file datatidy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datatidy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d4863aca68e47d55180e8467d2f4e8424bc56eac8caa0fda2e736e0527952cc
|
|
| MD5 |
e1bd75844d33599dc137e1d10222188d
|
|
| BLAKE2b-256 |
944f4f1668d28f49f237f90665930d34f6e5a6dca3d1099e99bd9ed1f1dcbbe7
|