Skip to main content

Intelligent data cleaning agent for automated data quality improvement

Project description

Cleaning Agent

Intelligent data cleaning agent for automated data quality improvement.

🚀 Features

  • Automated Data Quality Analysis: Detect missing values, duplicates, outliers, and data type inconsistencies
  • Intelligent Cleaning Strategies: AI-powered decision making for optimal cleaning approaches
  • LLM-Driven Cleaning: Leverage Large Language Models to automatically generate and execute Python code for complex data cleaning tasks.
  • Multiple Data Format Support: CSV, Excel, JSON, Parquet, and pandas DataFrames
  • Comprehensive Reporting: Detailed cleaning reports with metrics and recommendations
  • Configurable Parameters: Customize cleaning behavior and thresholds
  • Command Line Interface: Easy-to-use CLI for batch processing
  • Python API: Simple integration into existing workflows

🏗️ Architecture

The Cleaning Agent follows a modular architecture:

CleaningAgent
├── DataQualityAnalyzer    # Analyzes data quality and detects issues
├── CleaningValidator      # Validates cleaned data and provides assessment
├── Configuration          # Manages agent settings and parameters
└── Models                 # Data structures for requests, responses, and reports

Data Quality Metrics

  • Overall Quality Score: 0-1 scale based on multiple factors
  • Missing Value Analysis: Per-column missing value statistics
  • Duplicate Analysis: Duplicate row counts and percentages
  • Data Type Analysis: Column data type distribution
  • Uniqueness Analysis: Unique value counts per column

🔍 Supported Data Quality Issues

Missing Values

  • Detection: Automatic identification of columns with missing data
  • Handling: Smart imputation strategies (median for numerical, mode for categorical)
  • Thresholds: Configurable missing value percentage limits

Duplicate Rows

  • Detection: Identifies exact and near-duplicate rows
  • Removal: Configurable duplicate removal strategies
  • Analysis: Reports duplicate patterns and impact

Data Type Inconsistencies

  • Detection: Identifies columns with mixed or inappropriate data types
  • Standardization: Converts data types for consistency
  • Validation: Ensures data type appropriateness

Outliers

  • Detection: Statistical outlier detection using IQR method
  • Handling: Configurable outlier treatment (capping, removal, investigation)
  • Impact Assessment: Reports outlier impact on data quality

Developer Setup and Testing

Setup Instructions

  1. Clone the repository and checkout the feature branch:

    git clone https://github.com/stepfnAI/cleaning_agent.git 
    cd cleaning_agent
    git checkout review
    
  2. Install uv (if not already installed):

    # Option A: Using the standalone installer (recommended for macOS/Linux)
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # Option B: Using pip (if uv is already in an existing environment)
    pip install uv
    
  3. Create and activate a virtual environment:

    uv venv --python=3.10 venv
    source venv/bin/activate
    
  4. Install the project in editable mode with development dependencies:

    uv pip install -e ".[dev]"
    
  5. Clone and set up the sfn_blueprint dependency:

    cd ..
    git clone https://github.com/stepfnAI/sfn_blueprint.git
    cd sfn_blueprint
    source ../cleaning_agent/venv/bin/activate
    git checkout dev
    uv pip install -e .
    cd ../cleaning_agent
    
  6. Set your OpenAI API key:

    export OPENAI_API_KEY='your-api-key-here'
    

Example

  1. Run the example script:
    python examples/basic_usage.py
    

Running Tests

  1. Run the test suite:
    # Run all tests
    pytest tests/ -s
    
    # Run specific test files
    pytest tests/test_agent.py -s
    pytest tests/test_context_integration.py -s 
    pytest tests/test_execution_validation.py -s 
    pytest tests/test_llm_driven_cleaning.py -s
    pytest tests/test_llm_driven_cleaning_with_sql.py -s
    
Test Structure
tests/
├── test_agent.py                                        # Agent functionality tests
├── test_context_integration.py                          # Context integration tests
├── test_execution_validation.py                         # Execution validation tests
├── test_llm_driven_cleaning.py                          # LLM-driven cleaning tests
├── tests/test_llm_driven_cleaning_with_sql.py           # SQL cleaning tests
Test Dependencies

The following testing dependencies are automatically installed:

  • pytest>=7.0.0 - Test framework
  • pytest-cov>=4.0.0 - Coverage reporting
  • black>=23.0.0 - Code formatting
  • isort>=5.12.0 - Import sorting
  • flake8>=6.0.0 - Linting
  • mypy>=1.0.0 - Type checking

📊 Output and Reporting

Cleaning Response

{
    "success": True,
    "cleaned_data": DataFrame,
    "report": {
        "report_id": "uuid",
        "timestamp": "2024-01-01T00:00:00Z",
        "data_summary": {
            "original_shape": (1000, 10),
            "cleaned_shape": (950, 10),
            "rows_removed": 50,
            "columns_processed": 10
        },
        "issues_detected": [...],
        "cleaning_operations": [...],
        "quality_metrics": {
            "original_quality_score": 0.65,
            "final_quality_score": 0.89,
            "improvement": 0.24
        },
        "recommendations": [...],
        "execution_time": 2.34
    },
    "message": "Data cleaning completed successfully",
    "errors": [],
    "metadata": {...}
}

Additional Information

  • Python Version: 3.10+
  • Dependencies: Managed through pyproject.toml
  • Code Style: Follows PEP 8 with Black formatting

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleaning_agent-0.1.4.tar.gz (50.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleaning_agent-0.1.4-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file cleaning_agent-0.1.4.tar.gz.

File metadata

  • Download URL: cleaning_agent-0.1.4.tar.gz
  • Upload date:
  • Size: 50.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.0

File hashes

Hashes for cleaning_agent-0.1.4.tar.gz
Algorithm Hash digest
SHA256 60d0b16dafcf543ed4f3f1bc4fd879e8b1f188adbe36671c36a32f01f53c91c3
MD5 33394a40d61a067f0d48f596c4a171bd
BLAKE2b-256 dca0740d21e3fc0255ec5ec2d4edd6a0651f7689d44d278af5c4380fc95949e4

See more details on using hashes here.

File details

Details for the file cleaning_agent-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for cleaning_agent-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8d159478b43201704a5025194f18620404e4b609d536088a1b3c94cce1a8b881
MD5 4e6e9413fd4b3c31598c88a28a20d414
BLAKE2b-256 ec3342e80f9fbb9f5a1f464656543b19188a1d1781c360e76b5e5e8e0717fec7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page