Intelligent ML preprocessing pipeline tool with AI agents
Project description
AgentPrep
AgentPrep is an intelligent machine learning preprocessing pipeline tool that combines AI agents with deterministic validation to automate data preparation tasks. It provides a guided, interactive experience for cleaning data, engineering features, and ensuring data quality and governance compliance.
Features
- AI-Powered Agents: Leverages LLM agents (OpenAI, Anthropic, Gemini) for intelligent data quality improvements and feature engineering suggestions
- Deterministic Validation: All agent proposals are validated and executed deterministically for reproducibility
- Interactive CLI: User-friendly wizard interface - no configuration files required
- Constraint Advisor: Heuristic-based suggestions for pipeline constraints based on your dataset
- Governance & Policy: Built-in policy enforcement and data leakage detection
- Artifact Management: Comprehensive artifact tracking, storage, and reporting
- Metadata Tracking: Full provenance and metadata generation for auditability
Architecture
AgentPrep follows a multi-level pipeline architecture:
- Level 0: Intent Validation - Validates user configuration and constraints
- Level 1: Data Ingestion & Schema Normalization - Loads datasets, infers schemas, normalizes columns
- Level 2: Data Quality Agent - Profiles data quality and applies cleaning actions
- Level 3: Metadata & Profiling Persistence - Generates and stores comprehensive metadata
- Level 4: Feature Engineering Agent - Proposes and generates ML features
- Level 5: Governance & Policy - Enforces policies and detects data leakage
- Level 6: Artifacts, Storage & Reporting - Captures and exports pipeline artifacts
Installation
Prerequisites
- Python 3.8+
- pip
Setup
Method 1: Install via Pip (Recommended for Users)
pip install agentprep
Then run:
agentprep run
Method 2: Install from Source (For Developers)
- Clone the repository:
git clone <repository-url>
cd AgentPrep
- Run the setup script:
chmod +x setup_prod.sh
./setup_prod.sh
source venv/bin/activate
Or manually:
pip install .
Note: Always use a virtual environment to avoid dependency conflicts. See PRODUCTION_SETUP.md for detailed setup instructions.
- (Optional) Install LLM provider SDKs for AI agent features:
# For OpenAI
pip install openai
# For Anthropic
pip install anthropic
# For Google Gemini
pip install google-generativeai
Quick Start
Basic Usage
1. Set API Keys (Optional but Recommended)
For AI-powered features, set your API key as an environment variable:
export OPENAI_API_KEY="your-key-here"
# OR
export ANTHROPIC_API_KEY="your-key-here"
# OR
export GEMINI_API_KEY="your-key-here"
2. Run the Pipeline
Run the interactive wizard:
# If installed via pip:
agentprep run
# If running from source without installing:
python -m cli run
The interactive wizard will guide you through:
- LLM Provider (optional): Choose OpenAI, Anthropic, Gemini, or "None" (no LLM usage)
- Dataset Selection: Upload your CSV or Parquet file
- Task Configuration: Select task type (classification, regression, time series, clustering)
- Target Column: Choose your target variable from the dataset columns
- Model Family: Select your intended model type (tree-based, linear, neural)
- Constraint Suggestions: Get intelligent suggestions for pipeline constraints
- Output Path: Specify where to save pipeline outputs
Example Session
$ python -m cli run
============================================================
Welcome to AgentPrep!
============================================================
This interactive wizard will guide you through configuring your preprocessing pipeline.
Enter path to your dataset (CSV or Parquet): data/my_dataset.csv
✓ Dataset loaded: 10,000 rows, 15 columns
Available columns:
1. age
2. income
3. education
...
Select target column (1-15): 3
Select task type:
1. Classification
2. Regression
3. Time Series
4. Clustering
Select (1-4) [1]: 1
...
✓ Intent validated successfully
Starting preprocessing pipeline...
✓ Pipeline completed successfully
Command-Line Options
# Run with verbose logging
python -m cli run --verbose
# Specify output directory
python -m cli run --output ./results
# Run with config file (legacy mode)
python -m cli run --config intent.yaml
Configuration
Environment Variables
Set API keys for LLM providers (optional - agents work without them in stub mode). At runtime, the CLI will ask which provider you want to use (or "None").
# OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Google Gemini
export GEMINI_API_KEY="your-gemini-api-key"
Note: If no API keys are provided, agents will run in stub mode (no LLM proposals), but the pipeline will still execute deterministic operations.
Intent Schema
The pipeline accepts configuration through an IntentSchema that includes:
- Dataset: Path to your dataset file
- Task: Task type and target column
- Model: Model family and interpretability requirements
- Constraints: Limits on features, interactions, and cardinality
- Policies: Outlier handling and other data policies
Pipeline Levels
Level 1: Data Ingestion & Schema Normalization
- Loads datasets from CSV or Parquet files
- Infers schema metadata (data types, nullability, distributions)
- Normalizes column names and data types
- Validates dataset against intent constraints
Level 2: Data Quality Agent
- Profiles dataset quality (missing values, outliers, duplicates)
- LLM agent proposes data cleaning actions
- Deterministic executor validates and applies safe actions
- Tracks applied vs rejected actions
Level 3: Metadata & Profiling Persistence
- Builds comprehensive pipeline metadata
- Records schema, quality profiles, and applied actions
- Writes metadata to disk for traceability
Level 4: Feature Engineering Agent
- LLM agent proposes feature transformations
- Validates features for safety and compliance
- Generates features deterministically
- Tracks feature provenance
Level 5: Governance & Policy
- Enforces policy rules (constraint violations, data leakage)
- Validates feature engineering proposals
- Detects potential data leakage issues
- Provides governance decisions
Level 6: Artifacts, Storage & Reporting
- Captures all pipeline artifacts (datasets, schemas, features, metadata)
- Stores artifacts in organized directory structure
- Exports artifacts in multiple formats (JSON, CSV, Parquet, Markdown)
- Generates human-readable reports
Project Structure
AgentPrep/
├── cli/ # CLI modules
│ ├── interactive.py # Interactive prompts
│ └── constraint_advisor.py # Constraint suggestions
├── core/ # Core orchestration
│ └── orchestrator.py # Pipeline orchestrator
├── intent/ # Intent validation
│ ├── schema.py # Intent schema definitions
│ └── validator.py # Intent validation logic
├── level1_ingestion/ # Data loading & normalization
├── level2_quality/ # Data quality agent
├── level3_metadata/ # Metadata generation
├── level4_feature/ # Feature engineering agent
├── level5_governance/ # Governance & policies
├── level5_policy/ # Policy enforcement
├── level6_artifacts/ # Artifact management
├── utils/ # Shared utilities
│ ├── logging.py # Logging setup
│ ├── constants.py # Application constants
│ ├── file_helpers.py # File utilities
│ └── llm_client.py # LLM client wrapper
└── cli/ # CLI package (use: python -m cli)
Supported Formats
- Datasets: CSV, Parquet
- Configurations: YAML, JSON (via interactive mode)
- Output Formats: JSON, CSV, Parquet, Markdown
Exit Codes
0: Success1: Invalid intent configuration2: Policy violation detected3: Runtime error
Development
Running Tests
Tests are located in the tests/ directory. To run tests:
# Install test dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Run with coverage
pytest --cov=. --cov-report=html
Code Quality
We use black for formatting and ruff for linting:
# Install package with development dependencies
pip install -e ".[dev]"
Format code
black .
Lint code
ruff check .
### Code Structure
- **Modular Design**: Each level is self-contained with clear interfaces
- **Type Safety**: Uses Pydantic for schema validation and type hints throughout
- **Logging**: Centralized logging via `utils.logging`
- **Error Handling**: Comprehensive error handling with custom exception types
## Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
Key points:
1. Follow the existing code structure and naming conventions
2. Add tests for new features
3. Update documentation as needed
4. Ensure all tests pass before submitting
5. Format code with `black` and lint with `ruff`
## Security
For security vulnerabilities, please see [SECURITY.md](SECURITY.md). **Do not** open public issues for security concerns.
## Support
For issues, questions, or contributions, please [open an issue](link-to-issues) or [create a pull request](link-to-prs).
---
**AgentPrep** - Intelligent ML Preprocessing with AI Agents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentprep-0.1.0.tar.gz.
File metadata
- Download URL: agentprep-0.1.0.tar.gz
- Upload date:
- Size: 89.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e46db1e456e573fe313bbf6440250c45528b5933da6ee2e892e55c91c499e57f
|
|
| MD5 |
d49fd1477157f368fd446fd4de3b2f87
|
|
| BLAKE2b-256 |
2a107759c3905eccf6f026fdbf490c705e21c2c95b910affbc4af01fd2c7b9d5
|
File details
Details for the file agentprep-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentprep-0.1.0-py3-none-any.whl
- Upload date:
- Size: 105.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbba11ebd3b88c60952302b48c5a1380146289f550c018560612d43e6588bcc1
|
|
| MD5 |
913620c3541207f4cd79b09264382c25
|
|
| BLAKE2b-256 |
f7371fbe9e5909403dc2e4f3e84a525f3a5972e272e17980088cb513675e6ae7
|