Intelligent ML preprocessing pipeline tool with AI agents

Project description

AgentPrep

AgentPrep is an intelligent machine learning preprocessing pipeline tool that combines AI agents with deterministic validation to automate data preparation tasks. It provides a guided, interactive experience for cleaning data, engineering features, and ensuring data quality and governance compliance.

Features

AI-Powered Agents: Leverages LLM agents (OpenAI, Anthropic, Gemini) for intelligent data quality improvements and feature engineering suggestions
Deterministic Validation: All agent proposals are validated and executed deterministically for reproducibility
Interactive CLI: User-friendly wizard interface - no configuration files required
Constraint Advisor: Heuristic-based suggestions for pipeline constraints based on your dataset
Governance & Policy: Built-in policy enforcement and data leakage detection
Artifact Management: Comprehensive artifact tracking, storage, and reporting
Metadata Tracking: Full provenance and metadata generation for auditability

Architecture

AgentPrep follows a multi-level pipeline architecture:

Level 0: Intent Validation - Validates user configuration and constraints
Level 1: Data Ingestion & Schema Normalization - Loads datasets, infers schemas, normalizes columns
Level 2: Data Quality Agent - Profiles data quality and applies cleaning actions
Level 3: Metadata & Profiling Persistence - Generates and stores comprehensive metadata
Level 4: Feature Engineering Agent - Proposes and generates ML features
Level 5: Governance & Policy - Enforces policies and detects data leakage
Level 6: Artifacts, Storage & Reporting - Captures and exports pipeline artifacts

Installation

Prerequisites

Python 3.8+
pip

Setup

Method 1: Install via Pip (Recommended for Users)

pip install agentprep

Then run:

agentprep run

Method 2: Install from Source (For Developers)

Clone the repository:

git clone <repository-url>
cd AgentPrep

Run the setup script:

chmod +x setup_prod.sh
./setup_prod.sh
source venv/bin/activate

Or manually:

pip install .

Note: Always use a virtual environment to avoid dependency conflicts. See PRODUCTION_SETUP.md for detailed setup instructions.

(Optional) Install LLM provider SDKs for AI agent features:

# For OpenAI
pip install openai

# For Anthropic
pip install anthropic

# For Google Gemini
pip install google-generativeai

Quick Start

Basic Usage

1. Set API Keys (Optional but Recommended)

For AI-powered features, set your API key as an environment variable:

export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"
# OR
export GEMINI_API_KEY="your-key-here"

2. Run the Pipeline

Run the interactive wizard:

# If installed via pip:
agentprep run

# If running from source without installing:
python -m cli run

The interactive wizard will guide you through:

LLM Provider (optional): Choose OpenAI, Anthropic, Gemini, or "None" (no LLM usage)
Dataset Selection: Upload your CSV or Parquet file
Task Configuration: Select task type (classification, regression, time series, clustering)
Target Column: Choose your target variable from the dataset columns
Model Family: Select your intended model type (tree-based, linear, neural)
Constraint Suggestions: Get intelligent suggestions for pipeline constraints
Output Path: Specify where to save pipeline outputs

Example Session

$ python -m cli run

============================================================
Welcome to AgentPrep!
============================================================

This interactive wizard will guide you through configuring your preprocessing pipeline.

Enter path to your dataset (CSV or Parquet): data/my_dataset.csv
✓ Dataset loaded: 10,000 rows, 15 columns

Available columns:
  1. age
  2. income
  3. education
  ...
  
Select target column (1-15): 3

Select task type:
  1. Classification
  2. Regression
  3. Time Series
  4. Clustering
Select (1-4) [1]: 1

...

✓ Intent validated successfully
Starting preprocessing pipeline...
✓ Pipeline completed successfully

Command-Line Options

# Run with verbose logging
python -m cli run --verbose

# Specify output directory
python -m cli run --output ./results

# Run with config file (legacy mode)
python -m cli run --config intent.yaml

Configuration

Environment Variables

Set API keys for LLM providers (optional - agents work without them in stub mode). At runtime, the CLI will ask which provider you want to use (or "None").

# OpenAI
export OPENAI_API_KEY="your-openai-api-key"

# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Google Gemini
export GEMINI_API_KEY="your-gemini-api-key"

Note: If no API keys are provided, agents will run in stub mode (no LLM proposals), but the pipeline will still execute deterministic operations.

Intent Schema

The pipeline accepts configuration through an IntentSchema that includes:

Dataset: Path to your dataset file
Task: Task type and target column
Model: Model family and interpretability requirements
Constraints: Limits on features, interactions, and cardinality
Policies: Outlier handling and other data policies

Pipeline Levels

Level 1: Data Ingestion & Schema Normalization

Loads datasets from CSV or Parquet files
Infers schema metadata (data types, nullability, distributions)
Normalizes column names and data types
Validates dataset against intent constraints

Level 2: Data Quality Agent

Profiles dataset quality (missing values, outliers, duplicates)
LLM agent proposes data cleaning actions
Deterministic executor validates and applies safe actions
Tracks applied vs rejected actions

Level 3: Metadata & Profiling Persistence

Builds comprehensive pipeline metadata
Records schema, quality profiles, and applied actions
Writes metadata to disk for traceability

Level 4: Feature Engineering Agent

LLM agent proposes feature transformations
Validates features for safety and compliance
Generates features deterministically
Tracks feature provenance

Level 5: Governance & Policy

Enforces policy rules (constraint violations, data leakage)
Validates feature engineering proposals
Detects potential data leakage issues
Provides governance decisions

Level 6: Artifacts, Storage & Reporting

Captures all pipeline artifacts (datasets, schemas, features, metadata)
Stores artifacts in organized directory structure
Exports artifacts in multiple formats (JSON, CSV, Parquet, Markdown)
Generates human-readable reports

Project Structure

AgentPrep/
├── cli/                    # CLI modules
│   ├── interactive.py     # Interactive prompts
│   └── constraint_advisor.py  # Constraint suggestions
├── core/                   # Core orchestration
│   └── orchestrator.py    # Pipeline orchestrator
├── intent/                 # Intent validation
│   ├── schema.py          # Intent schema definitions
│   └── validator.py       # Intent validation logic
├── level1_ingestion/       # Data loading & normalization
├── level2_quality/         # Data quality agent
├── level3_metadata/       # Metadata generation
├── level4_feature/         # Feature engineering agent
├── level5_governance/      # Governance & policies
├── level5_policy/          # Policy enforcement
├── level6_artifacts/       # Artifact management
├── utils/                  # Shared utilities
│   ├── logging.py         # Logging setup
│   ├── constants.py       # Application constants
│   ├── file_helpers.py    # File utilities
│   └── llm_client.py      # LLM client wrapper
└── cli/                    # CLI package (use: python -m cli)

Supported Formats

Datasets: CSV, Parquet
Configurations: YAML, JSON (via interactive mode)
Output Formats: JSON, CSV, Parquet, Markdown

Exit Codes

0: Success
1: Invalid intent configuration
2: Policy violation detected
3: Runtime error

Development

Running Tests

Tests are located in the tests/ directory. To run tests:

# Install test dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Code Quality

We use black for formatting and ruff for linting:

# Install package with development dependencies
pip install -e ".[dev]"

Format code

black .

Lint code

ruff check .


### Code Structure

- **Modular Design**: Each level is self-contained with clear interfaces
- **Type Safety**: Uses Pydantic for schema validation and type hints throughout
- **Logging**: Centralized logging via `utils.logging`
- **Error Handling**: Comprehensive error handling with custom exception types

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

Key points:
1. Follow the existing code structure and naming conventions
2. Add tests for new features
3. Update documentation as needed
4. Ensure all tests pass before submitting
5. Format code with `black` and lint with `ruff`

## Security

For security vulnerabilities, please see [SECURITY.md](SECURITY.md). **Do not** open public issues for security concerns.

## Support

For issues, questions, or contributions, please [open an issue](link-to-issues) or [create a pull request](link-to-prs).

---

**AgentPrep** - Intelligent ML Preprocessing with AI Agents

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentprep-0.1.0.tar.gz (89.9 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentprep-0.1.0-py3-none-any.whl (105.9 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file agentprep-0.1.0.tar.gz.

File metadata

Download URL: agentprep-0.1.0.tar.gz
Upload date: Jan 27, 2026
Size: 89.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for agentprep-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e46db1e456e573fe313bbf6440250c45528b5933da6ee2e892e55c91c499e57f`
MD5	`d49fd1477157f368fd446fd4de3b2f87`
BLAKE2b-256	`2a107759c3905eccf6f026fdbf490c705e21c2c95b910affbc4af01fd2c7b9d5`

See more details on using hashes here.

File details

Details for the file agentprep-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentprep-0.1.0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 105.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for agentprep-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbba11ebd3b88c60952302b48c5a1380146289f550c018560612d43e6588bcc1`
MD5	`913620c3541207f4cd79b09264382c25`
BLAKE2b-256	`f7371fbe9e5909403dc2e4f3e84a525f3a5972e272e17980088cb513675e6ae7`

See more details on using hashes here.

agentprep 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

AgentPrep

Features

Architecture

Installation

Prerequisites

Setup

Method 1: Install via Pip (Recommended for Users)

Method 2: Install from Source (For Developers)

Quick Start

Basic Usage

1. Set API Keys (Optional but Recommended)

2. Run the Pipeline

Example Session

Command-Line Options

Configuration

Environment Variables

Intent Schema

Pipeline Levels

Level 1: Data Ingestion & Schema Normalization

Level 2: Data Quality Agent

Level 3: Metadata & Profiling Persistence

Level 4: Feature Engineering Agent

Level 5: Governance & Policy

Level 6: Artifacts, Storage & Reporting

Project Structure

Supported Formats

Exit Codes

Development

Running Tests

Code Quality

Format code

Lint code

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes