Skip to main content

One-line AutoML: from idea to trained model using Hugging Face + AutoGluon

Project description


title: AutoML AutoDB Select Pipeline emoji: ๐Ÿš€ colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false

๐Ÿš€ AutoHF

One-line AutoML: from idea to trained model using Hugging Face + AutoGluon.

AutoHF is an autonomous machine learning pipeline that takes a natural language description of a task (e.g., "sentiment analysis") and automatically finds the best datasets on Hugging Face, ranks them by quality, and trains a state-of-the-art model using AutoGluon.


โœจ Features

  • ๐Ÿ” Intent-to-Task: Automatically detects ML task types (classification, regression, etc.) and keywords from natural language.
  • ๐Ÿ“ฆ Autonomous Dataset Discovery: Searches the Hugging Face Hub for relevant datasets using multi-strategy search.
  • ๐Ÿ† Intelligent Ranking: Ranks datasets based on quality signals like downloads, likes, and metadata completeness.
  • ๐Ÿ‹๏ธ Automated Training: Leverages AutoGluon to train high-quality models with minimal configuration.
  • ๐Ÿงฌ Agentic Architecture: Inspired by patterns from AutoGen, LangGraph, and OpenHands for robust state management and collaboration.

๐Ÿ› ๏ธ Internal Workflow

The following diagram shows how AutoHF orchestrates the pipeline from user input to a trained model:

graph TD
    User([User Input: 'sentiment analysis']) --> CLI[CLI / Python API]
    CLI --> Orchestrator[AutoHF Orchestrator]
    
    subgraph "Autonomous Pipeline (LangGraph-inspired States)"
        Orchestrator --> State1[Detecting Task]
        State1 --> TaskAgent[TaskAgent: Detects task type & keywords]
        
        TaskAgent --> State2[Searching Datasets]
        State2 --> DatasetAgent[DatasetAgent: Searches HF Hub]
        
        DatasetAgent --> State3[Ranking Datasets]
        State3 --> Ranker[DatasetRanker: Ranks by quality signals]
        
        Ranker --> State4[Loading & Profiling]
        State4 --> Loader[DatasetAgent: Loads best candidate & profiles]
        
        Loader --> State5[Training]
        State5 --> Trainer[AutoGluonTrainer: Trains & Optimizes]
    end
    
    Trainer --> State6[Completed]
    State6 --> Result[TrainResult: Model + Metrics]
    Result --> User

๐Ÿ—๏ธ Project Structure

AutoHF follows a modular, layered architecture organized into six core packages:

autohf/
โ”œโ”€โ”€ __init__.py                 # Public API exports (AutoHF, AutoHFConfig, TrainResult, etc.)
โ”œโ”€โ”€ cli/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ main.py                 # Typer CLI: train, search, info subcommands
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ config.py               # Data models, presets, enums (PipelineState, TrainResult, DatasetCandidate...)
โ”‚   โ””โ”€โ”€ autohf.py               # AutoHF orchestrator โ€” central state-machine coordinator
โ”œโ”€โ”€ agents/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ task_agent.py           # Intent-to-task detection (keyword / OpenAI router)
โ”‚   โ”œโ”€โ”€ dataset_agent.py        # Dataset discovery, loading, profiling (3-strategy HF Hub search)
โ”‚   โ””โ”€โ”€ model_agent.py          # Model search agent (Phase 2 preparation)
โ”œโ”€โ”€ ranking/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ dataset_ranker.py       # Keyword-based composite scoring (default)
โ”‚   โ”œโ”€โ”€ semantic_ranker.py      # Vector + Cross-Encoder semantic ranking (optional dep)
โ”‚   โ””โ”€โ”€ model_ranker.py         # Model ranking stub (Phase 2)
โ””โ”€โ”€ training/
    โ”œโ”€โ”€ __init__.py
    โ””โ”€โ”€ autogluon_trainer.py    # AutoGluon TabularPredictor wrapper (fit, eval, predict)

tests/
โ”œโ”€โ”€ test_config.py              # Config defaults & preset validation
โ””โ”€โ”€ test_task_agent.py          # Keyword detection, fuzzy fallback, history

pyproject.toml                  # Build config, dependencies, CLI entry point, lint/test settings
README.md                       # This file

Module Responsibilities

Package Responsibility Key Classes / Functions
core Configuration, data models, orchestration AutoHFConfig, PipelineState, AutoHF
agents External interaction โ€” task detection, dataset/model discovery TaskAgent, DatasetAgent, ModelAgent
ranking Relevance & quality scoring for datasets and models DatasetRanker, SemanticRanker, rank_models
training Model training, evaluation, and inference train_model, load_predictor, predict
cli User-facing command-line interface train, search, info

๐Ÿ—๏ธ Architecture & Patterns

AutoHF is built using modern software engineering patterns for AI:

  • State Management: Uses a typed state machine (PipelineState) inspired by LangGraph to track progress and handle transitions through the pipeline.
  • Agent Collaboration: Employs specialized agents (TaskAgent, DatasetAgent, ModelAgent) similar to AutoGen to separate concerns and enable independent extensibility.
  • Autonomous Execution: Implements retry logic and multi-strategy discovery patterns found in OpenHands for resilient dataset sourcing.
  • Tabular Power: Uses AutoGluon as the underlying engine for robust, automated model selection and hyperparameter tuning.

๐Ÿ› ๏ธ Internal Workflow

Pipeline State Machine

The following diagram shows how AutoHF orchestrates the pipeline from user input to a trained model, including retry logic and ranking selection:

graph TD
    User([User Input: 'sentiment analysis']) --> CLI[CLI / Python API]
    CLI --> Orchestrator[AutoHF Orchestrator]
    
    subgraph "Autonomous Pipeline (LangGraph-inspired States)"
        Orchestrator --> State1[IDLE]
        State1 --> State2[DETECTING_TASK]
        State2 --> TaskAgent[TaskAgent: keyword / OpenAI router]
        TaskAgent --> State3[SEARCHING_DATASETS]
        State3 --> DatasetAgent[DatasetAgent: 3-strategy HF Hub search]
        DatasetAgent --> State4[RANKING_DATASETS]
        State4 --> RankerDecision{Ranker?}
        RankerDecision -->|default| DatasetRanker[DatasetRanker: keyword composite scoring]
        RankerDecision -->|search extras| SemanticRanker[SemanticRanker: vector + Cross-Encoder]
        DatasetRanker --> State5[LOADING_DATASET]
        SemanticRanker --> State5
        State5 --> LoadRetry{Load OK?}
        LoadRetry -->|No| DatasetAgent
        LoadRetry -->|Yes| State6[PROFILING_DATASET]
        State6 --> Profile[profile_dataset: stats + samples]
        Profile --> State7[TRAINING]
        State7 --> Trainer[AutoGluonTrainer: TabularPredictor.fit]
    end
    
    Trainer --> State8[EVALUATING]
    State8 --> State9[COMPLETED]
    State9 --> Result[TrainResult: model + metrics + paths]
    Result --> User

Class / Module Dependency Diagram

classDiagram
    class AutoHF {
        -config: AutoHFConfig
        -task_agent: TaskAgent
        -dataset_agent: DatasetAgent
        +train(task_description) TrainResult
        +search(task_description) list[DatasetCandidate]
    }
    
    class AutoHFConfig {
        +preset: Preset
        +time_limit: int
        +max_rows: int
        +problem_type: ProblemType
    }
    
    class TaskAgent {
        +detect_task(description) TaskInfo
        +list_supported_tasks()
    }
    
    class DatasetAgent {
        +find_datasets(task_type, keywords) list[DatasetCandidate]
        +load(dataset_id, config) DataFrame + cols
        +profile_dataset(df, text_col, label_col) DatasetProfile
    }
    
    class DatasetRanker {
        +rank_datasets(candidates, keywords) list[DatasetCandidate]
    }
    
    class SemanticRanker {
        +rank(candidates, problem_statement, keywords) list[DatasetCandidate]
    }
    
    class AutoGluonTrainer {
        +train_model(df, config, label) TrainResult
        +load_predictor(path) TabularPredictor
        +predict(predictor, df) Series
    }
    
    class TrainResult {
        +best_model_name: str
        +metrics: dict
        +model_path: str
        +leaderboard: DataFrame
    }
    
    class DatasetCandidate {
        +id: str
        +description: str
        +downloads: int
        +likes: int
        +tags: list[str]
        +score: float
    }

    CLI --> AutoHF : Uses
    AutoHF --> AutoHFConfig : Configures
    AutoHF --> TaskAgent : Orchestrates
    AutoHF --> DatasetAgent : Orchestrates
    AutoHF --> DatasetRanker : Uses
    AutoHF --> SemanticRanker : Uses [optional]
    AutoHF --> AutoGluonTrainer : Triggers
    AutoHF --> TrainResult : Returns
    DatasetAgent --> DatasetCandidate : Produces
    DatasetRanker --> DatasetCandidate : Ranks
    SemanticRanker --> DatasetCandidate : Ranks

Installation

# Basic installation
pip install autohf

# With training support (recommended)
pip install "autohf[train]"

CLI Usage

Train a model with a single command:

# Quick prototype
autohf train "sentiment analysis"

# Higher quality training
autohf train "spam detection" --preset high_quality

# Just search for datasets
autohf search "question answering" --models

Python API

from autohf import AutoHF

# Initialize and train
hf = AutoHF.from_preset("medium_quality")
result = hf.train("customer review classification")

# Access results
print(f"Best model: {result.best_model_name}")
print(f"Accuracy: {result.metrics['accuracy']}")
print(f"Model saved at: {result.model_path}")

๐Ÿ“‹ Presets

AutoHF provides several presets inspired by AutoGluon to balance speed and quality:

Preset Time Limit Focus
quick_prototype 60s Fast iteration, small datasets
medium_quality 300s Default - Good balance of speed/quality
high_quality 600s Better results, longer training
best_quality 3600s Maximum performance
optimize_for_deployment 300s Small model size, fast inference

๐Ÿ—๏ธ Architecture & Patterns

AutoHF is built using modern software engineering patterns for AI:

  • State Management: Uses a typed state machine (via PipelineState) inspired by LangGraph to track progress and handle transitions.
  • Agent Collaboration: Employs specialized agents (TaskAgent, DatasetAgent) similar to AutoGen to separate concerns.
  • Autonomous Execution: Implements retry logic and multi-strategy discovery patterns found in OpenHands.
  • Tabular Power: Uses AutoGluon as the underlying engine for robust, automated model selection and hyperparameter tuning.

๐Ÿ—บ๏ธ Project Roadmap

Here is the planned development roadmap for AutoHF. Contributions and suggestions are welcome!

Phase 1: Core Pipeline (Completed / In Progress)

  • Intent-to-Task detection with keyword extraction
  • Autonomous Hugging Face dataset search with multi-strategy discovery
  • Intelligent dataset ranking (downloads, likes, metadata)
  • AutoGluon-based automated training integration
  • CLI and Python API entry points
  • Configuration presets (quick/medium/high/best quality)
  • Agentic architecture with TaskAgent, DatasetAgent, and DatasetRanker

Phase 2: Enhanced Model Hub

  • Support for custom model fine-tuning (beyond AutoGluon tabular models)
  • Integration with Hugging Face Model Hub for downloading pre-trained models
  • Multi-modal support (image, audio, text classification)
  • Model versioning and experiment tracking

Phase 3: Advanced Dataset Management

  • Dataset quality validation (missing values, class imbalance detection)
  • Automatic dataset cleaning and preprocessing recommendations
  • Train/validation/test split optimization
  • Dataset caching and local mirror support

Phase 4: Deployment & Serving

  • Model export to ONNX, TorchScript, and CoreML formats
  • REST API serving with FastAPI
  • Docker containerization for easy deployment
  • Batch prediction pipelines

Phase 5: Observability & Collaboration

  • Training metrics dashboard
  • Pipeline execution logs and audit trails
  • Team collaboration features (shared datasets, model registry)
  • CI/CD integration for model retraining

Phase 6: Enterprise Features

  • Private Hugging Face Hub / AWS S3 / Azure Blob Storage support
  • Role-based access control (RBAC)
  • Scalable distributed training support
  • Compliance and governance tooling

๐Ÿ“œ License

MIT License. See LICENSE for details.


๐Ÿค– Auto-Push Scripts

AutoHF includes scripts for automated git pushing:

PowerShell (Windows)

.\git-auto-push.ps1 "Your commit message"
.\git-auto-push.ps1 "Your commit message" -Push:$false  # Skip push

Batch (Windows)

git-auto-push.bat "Your commit message"
git-auto-push.bat "Your commit message" nopush  # Skip push

Shell/Bash (Linux/macOS/WSL)

./git-auto-push.sh "Your commit message"
./git-auto-push.sh "Your commit message" nopush  # Skip push

These scripts automatically:

  1. Stage all changes (git add -A)
  2. Check for changes
  3. Commit with your message
  4. Push to the remote repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autohf-0.1.0.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autohf-0.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file autohf-0.1.0.tar.gz.

File metadata

  • Download URL: autohf-0.1.0.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autohf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e84a0cdd74f13a069468b216286e0475ded8fad7db5efa6d46980bdfb89e5c3e
MD5 414ca8ee981524cff3f0ed483fc4fcb0
BLAKE2b-256 481821c996c7101692c7b37d8aa06f9b353b17038de31d21528f9894a69565ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for autohf-0.1.0.tar.gz:

Publisher: publish.yml on teambugbusters00/automl-pipeine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autohf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: autohf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autohf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 069fabf4933a19fb590fdff0bed00c3ee42e4e26807e001490d035d0dd5f537d
MD5 ed7916652e6f1d93551acb8996e7b18e
BLAKE2b-256 762dbb039e0df1fd2701ba48ae0e75be76e73da4238129dff74e3dbb0755fe69

See more details on using hashes here.

Provenance

The following attestation bundles were made for autohf-0.1.0-py3-none-any.whl:

Publisher: publish.yml on teambugbusters00/automl-pipeine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page