Skip to main content

Intelligent ML preprocessing pipeline tool with AI agents

Project description

AgentPrep

AgentPrep is an intelligent machine learning preprocessing pipeline tool that combines AI agents with deterministic validation to automate data preparation tasks. It provides a guided, interactive experience for cleaning data, engineering features, and ensuring data quality and governance compliance.

Features

  • AI-Powered Agents: Leverages LLM agents (OpenAI, Anthropic, Gemini) for intelligent data quality improvements and feature engineering suggestions
  • Deterministic Validation: All agent proposals are validated and executed deterministically for reproducibility
  • Interactive CLI: User-friendly wizard interface - no configuration files required
  • Constraint Advisor: Heuristic-based suggestions for pipeline constraints based on your dataset
  • Governance & Policy: Built-in policy enforcement and data leakage detection
  • Artifact Management: Comprehensive artifact tracking, storage, and reporting
  • Metadata Tracking: Full provenance and metadata generation for auditability

Architecture

AgentPrep follows a multi-level pipeline architecture:

  • Level 0: Intent Validation - Validates user configuration and constraints
  • Level 1: Data Ingestion & Schema Normalization - Loads datasets, infers schemas, normalizes columns
  • Level 2: Data Quality Agent - Profiles data quality and applies cleaning actions
  • Level 3: Metadata & Profiling Persistence - Generates and stores comprehensive metadata
  • Level 4: Feature Engineering Agent - Proposes and generates ML features
  • Level 5: Governance & Policy - Enforces policies and detects data leakage
  • Level 6: Artifacts, Storage & Reporting - Captures and exports pipeline artifacts

Installation

Prerequisites

  • Python 3.8+
  • pip

Setup

Method 1: Install via Pip (Recommended for Users)

pip install agentprep

Then run:

agentprep run

Method 2: Install from Source (For Developers)

  1. Clone the repository:
git clone <repository-url>
cd AgentPrep
  1. Run the setup script:
chmod +x setup_prod.sh
./setup_prod.sh
source venv/bin/activate

Or manually:

pip install .

Note: Always use a virtual environment to avoid dependency conflicts. See PRODUCTION_SETUP.md for detailed setup instructions.

  1. (Optional) Install LLM provider SDKs for AI agent features:
# For OpenAI
pip install openai

# For Anthropic
pip install anthropic

# For Google Gemini
pip install google-generativeai

Quick Start

Basic Usage

1. Set API Keys (Optional but Recommended)

For AI-powered features, set your API key as an environment variable:

export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"
# OR
export GEMINI_API_KEY="your-key-here"

2. Run the Pipeline

Run the interactive wizard:

# If installed via pip:
agentprep run

# If running from source without installing:
python -m cli run

The interactive wizard will guide you through:

  1. LLM Provider (optional): Choose OpenAI, Anthropic, Gemini, or "None" (no LLM usage)
  2. Dataset Selection: Upload your CSV or Parquet file
  3. Task Configuration: Select task type (classification, regression, time series, clustering)
  4. Target Column: Choose your target variable from the dataset columns
  5. Model Family: Select your intended model type (tree-based, linear, neural)
  6. Constraint Suggestions: Get intelligent suggestions for pipeline constraints
  7. Output Path: Specify where to save pipeline outputs

Example Session

$ python -m cli run

============================================================
Welcome to AgentPrep!
============================================================

This interactive wizard will guide you through configuring your preprocessing pipeline.

Enter path to your dataset (CSV or Parquet): data/my_dataset.csv
✓ Dataset loaded: 10,000 rows, 15 columns

Available columns:
  1. age
  2. income
  3. education
  ...
  
Select target column (1-15): 3

Select task type:
  1. Classification
  2. Regression
  3. Time Series
  4. Clustering
Select (1-4) [1]: 1

...

✓ Intent validated successfully
Starting preprocessing pipeline...
✓ Pipeline completed successfully

Command-Line Options

# Run with verbose logging
python -m cli run --verbose

# Specify output directory
python -m cli run --output ./results

# Run with config file (legacy mode)
python -m cli run --config intent.yaml

Configuration

Environment Variables

Set API keys for LLM providers (optional - agents work without them in stub mode). At runtime, the CLI will ask which provider you want to use (or "None").

# OpenAI
export OPENAI_API_KEY="your-openai-api-key"

# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Google Gemini
export GEMINI_API_KEY="your-gemini-api-key"

Note: If no API keys are provided, agents will run in stub mode (no LLM proposals), but the pipeline will still execute deterministic operations.

Intent Schema

The pipeline accepts configuration through an IntentSchema that includes:

  • Dataset: Path to your dataset file
  • Task: Task type and target column
  • Model: Model family and interpretability requirements
  • Constraints: Limits on features, interactions, and cardinality
  • Policies: Outlier handling and other data policies

Pipeline Levels

Level 1: Data Ingestion & Schema Normalization

  • Loads datasets from CSV or Parquet files
  • Infers schema metadata (data types, nullability, distributions)
  • Normalizes column names and data types
  • Validates dataset against intent constraints

Level 2: Data Quality Agent

  • Profiles dataset quality (missing values, outliers, duplicates)
  • LLM agent proposes data cleaning actions
  • Deterministic executor validates and applies safe actions
  • Tracks applied vs rejected actions

Level 3: Metadata & Profiling Persistence

  • Builds comprehensive pipeline metadata
  • Records schema, quality profiles, and applied actions
  • Writes metadata to disk for traceability

Level 4: Feature Engineering Agent

  • LLM agent proposes feature transformations
  • Validates features for safety and compliance
  • Generates features deterministically
  • Tracks feature provenance

Level 5: Governance & Policy

  • Enforces policy rules (constraint violations, data leakage)
  • Validates feature engineering proposals
  • Detects potential data leakage issues
  • Provides governance decisions

Level 6: Artifacts, Storage & Reporting

  • Captures all pipeline artifacts (datasets, schemas, features, metadata)
  • Stores artifacts in organized directory structure
  • Exports artifacts in multiple formats (JSON, CSV, Parquet, Markdown)
  • Generates human-readable reports

Project Structure

AgentPrep/
├── cli/                    # CLI modules
│   ├── interactive.py     # Interactive prompts
│   └── constraint_advisor.py  # Constraint suggestions
├── core/                   # Core orchestration
│   └── orchestrator.py    # Pipeline orchestrator
├── intent/                 # Intent validation
│   ├── schema.py          # Intent schema definitions
│   └── validator.py       # Intent validation logic
├── level1_ingestion/       # Data loading & normalization
├── level2_quality/         # Data quality agent
├── level3_metadata/       # Metadata generation
├── level4_feature/         # Feature engineering agent
├── level5_governance/      # Governance & policies
├── level5_policy/          # Policy enforcement
├── level6_artifacts/       # Artifact management
├── utils/                  # Shared utilities
│   ├── logging.py         # Logging setup
│   ├── constants.py       # Application constants
│   ├── file_helpers.py    # File utilities
│   └── llm_client.py      # LLM client wrapper
└── cli/                    # CLI package (use: python -m cli)

Supported Formats

  • Datasets: CSV, Parquet
  • Configurations: YAML, JSON (via interactive mode)
  • Output Formats: JSON, CSV, Parquet, Markdown

Exit Codes

  • 0: Success
  • 1: Invalid intent configuration
  • 2: Policy violation detected
  • 3: Runtime error

Development

Running Tests

Tests are located in the tests/ directory. To run tests:

# Install test dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Code Quality

We use black for formatting and ruff for linting:

# Install package with development dependencies
pip install -e ".[dev]"

Format code

black .

Lint code

ruff check .


### Code Structure

- **Modular Design**: Each level is self-contained with clear interfaces
- **Type Safety**: Uses Pydantic for schema validation and type hints throughout
- **Logging**: Centralized logging via `utils.logging`
- **Error Handling**: Comprehensive error handling with custom exception types

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

Key points:
1. Follow the existing code structure and naming conventions
2. Add tests for new features
3. Update documentation as needed
4. Ensure all tests pass before submitting
5. Format code with `black` and lint with `ruff`

## Security

For security vulnerabilities, please see [SECURITY.md](SECURITY.md). **Do not** open public issues for security concerns.

## Support

For issues, questions, or contributions, please [open an issue](link-to-issues) or [create a pull request](link-to-prs).

---

**AgentPrep** - Intelligent ML Preprocessing with AI Agents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprep-0.1.0.tar.gz (89.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentprep-0.1.0-py3-none-any.whl (105.9 kB view details)

Uploaded Python 3

File details

Details for the file agentprep-0.1.0.tar.gz.

File metadata

  • Download URL: agentprep-0.1.0.tar.gz
  • Upload date:
  • Size: 89.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for agentprep-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e46db1e456e573fe313bbf6440250c45528b5933da6ee2e892e55c91c499e57f
MD5 d49fd1477157f368fd446fd4de3b2f87
BLAKE2b-256 2a107759c3905eccf6f026fdbf490c705e21c2c95b910affbc4af01fd2c7b9d5

See more details on using hashes here.

File details

Details for the file agentprep-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentprep-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 105.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for agentprep-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cbba11ebd3b88c60952302b48c5a1380146289f550c018560612d43e6588bcc1
MD5 913620c3541207f4cd79b09264382c25
BLAKE2b-256 f7371fbe9e5909403dc2e4f3e84a525f3a5972e272e17980088cb513675e6ae7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page