Intelligent data cleaning agent for automated data quality improvement
Project description
Cleaning Agent
Intelligent data cleaning agent for automated data quality improvement.
🚀 Features
- Automated Data Quality Analysis: Detect missing values, duplicates, outliers, and data type inconsistencies
- Intelligent Cleaning Strategies: AI-powered decision making for optimal cleaning approaches
- LLM-Driven Cleaning: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.
- Multiple Data Format Support: CSV, Excel, JSON, Parquet, and pandas DataFrames
- Comprehensive Reporting: Detailed cleaning reports with metrics and recommendations
- Configurable Parameters: Customize cleaning behavior and thresholds
- Command Line Interface: Easy-to-use CLI for batch processing
- Python API: Simple integration into existing workflows
🏗️ Architecture
The Cleaning Agent follows a modular architecture:
CleaningAgent
├── DataQualityAnalyzer # Analyzes data quality and detects issues
├── CleaningValidator # Validates cleaned data and provides assessment
├── Configuration # Manages agent settings and parameters
└── Models # Data structures for requests, responses, and reports
Data Quality Metrics
- Overall Quality Score: 0-1 scale based on multiple factors
- Missing Value Analysis: Per-column missing value statistics
- Duplicate Analysis: Duplicate row counts and percentages
- Data Type Analysis: Column data type distribution
- Uniqueness Analysis: Unique value counts per column
🔍 Supported Data Quality Issues
Missing Values
- Detection: Automatic identification of columns with missing data
- Handling: Smart imputation strategies (median for numerical, mode for categorical)
- Thresholds: Configurable missing value percentage limits
Duplicate Rows
- Detection: Identifies exact and near-duplicate rows
- Removal: Configurable duplicate removal strategies
- Analysis: Reports duplicate patterns and impact
Data Type Inconsistencies
- Detection: Identifies columns with mixed or inappropriate data types
- Standardization: Converts data types for consistency
- Validation: Ensures data type appropriateness
Outliers
- Detection: Statistical outlier detection using IQR method
- Handling: Configurable outlier treatment (capping, removal, investigation)
- Impact Assessment: Reports outlier impact on data quality
Developer Setup and Testing
Setup Instructions
-
Clone the repository and checkout the feature branch:
git clone https://github.com/stepfnAI/cleaning_agent.git cd cleaning_agent git checkout review
-
Install uv (if not already installed):
# Option A: Using the standalone installer (recommended for macOS/Linux) curl -LsSf https://astral.sh/uv/install.sh | sh # Option B: Using pip (if uv is already in an existing environment) pip install uv
-
Create and activate a virtual environment:
uv venv --python=3.10 venv source venv/bin/activate
-
Install the project in editable mode with development dependencies:
uv pip install -e ".[dev]"
-
Clone and set up the sfn_blueprint dependency:
cd .. git clone https://github.com/stepfnAI/sfn_blueprint.git cd sfn_blueprint source ../cleaning_agent/venv/bin/activate git checkout dev uv pip install -e . cd ../cleaning_agent
-
Set your OpenAI API key:
export OPENAI_API_KEY='your-api-key-here'
Example
- Run the example script:
python examples/basic_usage.py
Running Tests
- Run the test suite:
# Run all tests pytest tests/ -s # Run specific test files pytest tests/test_agent.py -s pytest tests/test_context_integration.py -s pytest tests/test_execution_validation.py -s pytest tests/test_llm_driven_cleaning.py -s pytest tests/test_llm_driven_cleaning_with_sql.py -s
Test Structure
tests/
├── test_agent.py # Agent functionality tests
├── test_context_integration.py # Context integration tests
├── test_execution_validation.py # Execution validation tests
├── test_llm_driven_cleaning.py # LLM-driven cleaning tests
├── tests/test_llm_driven_cleaning_with_sql.py # SQL cleaning tests
Test Dependencies
The following testing dependencies are automatically installed:
pytest>=7.0.0- Test frameworkpytest-cov>=4.0.0- Coverage reportingblack>=23.0.0- Code formattingisort>=5.12.0- Import sortingflake8>=6.0.0- Lintingmypy>=1.0.0- Type checking
📊 Output and Reporting
Cleaning Response
{
"success": True,
"cleaned_data": DataFrame,
"report": {
"report_id": "uuid",
"timestamp": "2024-01-01T00:00:00Z",
"data_summary": {
"original_shape": (1000, 10),
"cleaned_shape": (950, 10),
"rows_removed": 50,
"columns_processed": 10
},
"issues_detected": [...],
"cleaning_operations": [...],
"quality_metrics": {
"original_quality_score": 0.65,
"final_quality_score": 0.89,
"improvement": 0.24
},
"recommendations": [...],
"execution_time": 2.34
},
"message": "Data cleaning completed successfully",
"errors": [],
"metadata": {...}
}
Additional Information
- Python Version: 3.10+
- Dependencies: Managed through
pyproject.toml - Code Style: Follows PEP 8 with Black formatting
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleaning_agent-0.1.8.tar.gz.
File metadata
- Download URL: cleaning_agent-0.1.8.tar.gz
- Upload date:
- Size: 50.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b089fd0027987dbabd94691baa791fbf8f1e391852d69a86372bc4d80d6c4b0a
|
|
| MD5 |
7aef88040278e05de88661989744a603
|
|
| BLAKE2b-256 |
22806409e96e666faf205271bb2a988da3732ea9e94f058039360fdaa4250e5e
|
File details
Details for the file cleaning_agent-0.1.8-py3-none-any.whl.
File metadata
- Download URL: cleaning_agent-0.1.8-py3-none-any.whl
- Upload date:
- Size: 43.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1a8764b9ff5406ddee8d311970d4ef491271076fdbd350cd1c81f70927ed34a
|
|
| MD5 |
afa7f4c1e7e126272ab18fea96ae7eb3
|
|
| BLAKE2b-256 |
9bd3a18d17f9dd211469e29a133040e41d789dd9c3a89d919787c9af74f28ed9
|