One-line AutoML: from idea to trained model using Hugging Face + AutoGluon
Project description
title: AutoML AutoDB Select Pipeline emoji: ๐ colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false
๐ AutoHF
One-line AutoML: from idea to trained model using Hugging Face + AutoGluon.
AutoHF is an autonomous machine learning pipeline that takes a natural language description of a task (e.g., "sentiment analysis") and automatically finds the best datasets on Hugging Face, ranks them by quality, and trains a state-of-the-art model using AutoGluon.
โจ Features
- ๐ Intent-to-Task: Automatically detects ML task types (classification, regression, etc.) and keywords from natural language.
- ๐ฆ Autonomous Dataset Discovery: Searches the Hugging Face Hub for relevant datasets using multi-strategy search.
- ๐ Intelligent Ranking: Ranks datasets based on quality signals like downloads, likes, and metadata completeness.
- ๐๏ธ Automated Training: Leverages AutoGluon to train high-quality models with minimal configuration.
- ๐งฌ Agentic Architecture: Inspired by patterns from AutoGen, LangGraph, and OpenHands for robust state management and collaboration.
๐ ๏ธ Internal Workflow
The following diagram shows how AutoHF orchestrates the pipeline from user input to a trained model:
graph TD
User([User Input: 'sentiment analysis']) --> CLI[CLI / Python API]
CLI --> Orchestrator[AutoHF Orchestrator]
subgraph "Autonomous Pipeline (LangGraph-inspired States)"
Orchestrator --> State1[Detecting Task]
State1 --> TaskAgent[TaskAgent: Detects task type & keywords]
TaskAgent --> State2[Searching Datasets]
State2 --> DatasetAgent[DatasetAgent: Searches HF Hub]
DatasetAgent --> State3[Ranking Datasets]
State3 --> Ranker[DatasetRanker: Ranks by quality signals]
Ranker --> State4[Loading & Profiling]
State4 --> Loader[DatasetAgent: Loads best candidate & profiles]
Loader --> State5[Training]
State5 --> Trainer[AutoGluonTrainer: Trains & Optimizes]
end
Trainer --> State6[Completed]
State6 --> Result[TrainResult: Model + Metrics]
Result --> User
๐๏ธ Project Structure
AutoHF follows a modular, layered architecture organized into six core packages:
autohf/
โโโ __init__.py # Public API exports (AutoHF, AutoHFConfig, TrainResult, etc.)
โโโ cli/
โ โโโ __init__.py
โ โโโ main.py # Typer CLI: train, search, info subcommands
โโโ core/
โ โโโ __init__.py
โ โโโ config.py # Data models, presets, enums (PipelineState, TrainResult, DatasetCandidate...)
โ โโโ autohf.py # AutoHF orchestrator โ central state-machine coordinator
โโโ agents/
โ โโโ __init__.py
โ โโโ task_agent.py # Intent-to-task detection (keyword / OpenAI router)
โ โโโ dataset_agent.py # Dataset discovery, loading, profiling (3-strategy HF Hub search)
โ โโโ model_agent.py # Model search agent (Phase 2 preparation)
โโโ ranking/
โ โโโ __init__.py
โ โโโ dataset_ranker.py # Keyword-based composite scoring (default)
โ โโโ semantic_ranker.py # Vector + Cross-Encoder semantic ranking (optional dep)
โ โโโ model_ranker.py # Model ranking stub (Phase 2)
โโโ training/
โโโ __init__.py
โโโ autogluon_trainer.py # AutoGluon TabularPredictor wrapper (fit, eval, predict)
tests/
โโโ test_config.py # Config defaults & preset validation
โโโ test_task_agent.py # Keyword detection, fuzzy fallback, history
pyproject.toml # Build config, dependencies, CLI entry point, lint/test settings
README.md # This file
Module Responsibilities
| Package | Responsibility | Key Classes / Functions |
|---|---|---|
core |
Configuration, data models, orchestration | AutoHFConfig, PipelineState, AutoHF |
agents |
External interaction โ task detection, dataset/model discovery | TaskAgent, DatasetAgent, ModelAgent |
ranking |
Relevance & quality scoring for datasets and models | DatasetRanker, SemanticRanker, rank_models |
training |
Model training, evaluation, and inference | train_model, load_predictor, predict |
cli |
User-facing command-line interface | train, search, info |
๐๏ธ Architecture & Patterns
AutoHF is built using modern software engineering patterns for AI:
- State Management: Uses a typed state machine (
PipelineState) inspired by LangGraph to track progress and handle transitions through the pipeline. - Agent Collaboration: Employs specialized agents (
TaskAgent,DatasetAgent,ModelAgent) similar to AutoGen to separate concerns and enable independent extensibility. - Autonomous Execution: Implements retry logic and multi-strategy discovery patterns found in OpenHands for resilient dataset sourcing.
- Tabular Power: Uses AutoGluon as the underlying engine for robust, automated model selection and hyperparameter tuning.
๐ ๏ธ Internal Workflow
Pipeline State Machine
The following diagram shows how AutoHF orchestrates the pipeline from user input to a trained model, including retry logic and ranking selection:
graph TD
User([User Input: 'sentiment analysis']) --> CLI[CLI / Python API]
CLI --> Orchestrator[AutoHF Orchestrator]
subgraph "Autonomous Pipeline (LangGraph-inspired States)"
Orchestrator --> State1[IDLE]
State1 --> State2[DETECTING_TASK]
State2 --> TaskAgent[TaskAgent: keyword / OpenAI router]
TaskAgent --> State3[SEARCHING_DATASETS]
State3 --> DatasetAgent[DatasetAgent: 3-strategy HF Hub search]
DatasetAgent --> State4[RANKING_DATASETS]
State4 --> RankerDecision{Ranker?}
RankerDecision -->|default| DatasetRanker[DatasetRanker: keyword composite scoring]
RankerDecision -->|search extras| SemanticRanker[SemanticRanker: vector + Cross-Encoder]
DatasetRanker --> State5[LOADING_DATASET]
SemanticRanker --> State5
State5 --> LoadRetry{Load OK?}
LoadRetry -->|No| DatasetAgent
LoadRetry -->|Yes| State6[PROFILING_DATASET]
State6 --> Profile[profile_dataset: stats + samples]
Profile --> State7[TRAINING]
State7 --> Trainer[AutoGluonTrainer: TabularPredictor.fit]
end
Trainer --> State8[EVALUATING]
State8 --> State9[COMPLETED]
State9 --> Result[TrainResult: model + metrics + paths]
Result --> User
Class / Module Dependency Diagram
classDiagram
class AutoHF {
-config: AutoHFConfig
-task_agent: TaskAgent
-dataset_agent: DatasetAgent
+train(task_description) TrainResult
+search(task_description) list[DatasetCandidate]
}
class AutoHFConfig {
+preset: Preset
+time_limit: int
+max_rows: int
+problem_type: ProblemType
}
class TaskAgent {
+detect_task(description) TaskInfo
+list_supported_tasks()
}
class DatasetAgent {
+find_datasets(task_type, keywords) list[DatasetCandidate]
+load(dataset_id, config) DataFrame + cols
+profile_dataset(df, text_col, label_col) DatasetProfile
}
class DatasetRanker {
+rank_datasets(candidates, keywords) list[DatasetCandidate]
}
class SemanticRanker {
+rank(candidates, problem_statement, keywords) list[DatasetCandidate]
}
class AutoGluonTrainer {
+train_model(df, config, label) TrainResult
+load_predictor(path) TabularPredictor
+predict(predictor, df) Series
}
class TrainResult {
+best_model_name: str
+metrics: dict
+model_path: str
+leaderboard: DataFrame
}
class DatasetCandidate {
+id: str
+description: str
+downloads: int
+likes: int
+tags: list[str]
+score: float
}
CLI --> AutoHF : Uses
AutoHF --> AutoHFConfig : Configures
AutoHF --> TaskAgent : Orchestrates
AutoHF --> DatasetAgent : Orchestrates
AutoHF --> DatasetRanker : Uses
AutoHF --> SemanticRanker : Uses [optional]
AutoHF --> AutoGluonTrainer : Triggers
AutoHF --> TrainResult : Returns
DatasetAgent --> DatasetCandidate : Produces
DatasetRanker --> DatasetCandidate : Ranks
SemanticRanker --> DatasetCandidate : Ranks
Installation
# Basic installation
pip install autohf
# With training support (recommended)
pip install "autohf[train]"
CLI Usage
Train a model with a single command:
# Quick prototype
autohf train "sentiment analysis"
# Higher quality training
autohf train "spam detection" --preset high_quality
# Just search for datasets
autohf search "question answering" --models
Python API
from autohf import AutoHF
# Initialize and train
hf = AutoHF.from_preset("medium_quality")
result = hf.train("customer review classification")
# Access results
print(f"Best model: {result.best_model_name}")
print(f"Accuracy: {result.metrics['accuracy']}")
print(f"Model saved at: {result.model_path}")
๐ Presets
AutoHF provides several presets inspired by AutoGluon to balance speed and quality:
| Preset | Time Limit | Focus |
|---|---|---|
quick_prototype |
60s | Fast iteration, small datasets |
medium_quality |
300s | Default - Good balance of speed/quality |
high_quality |
600s | Better results, longer training |
best_quality |
3600s | Maximum performance |
optimize_for_deployment |
300s | Small model size, fast inference |
๐๏ธ Architecture & Patterns
AutoHF is built using modern software engineering patterns for AI:
- State Management: Uses a typed state machine (via
PipelineState) inspired by LangGraph to track progress and handle transitions. - Agent Collaboration: Employs specialized agents (TaskAgent, DatasetAgent) similar to AutoGen to separate concerns.
- Autonomous Execution: Implements retry logic and multi-strategy discovery patterns found in OpenHands.
- Tabular Power: Uses AutoGluon as the underlying engine for robust, automated model selection and hyperparameter tuning.
๐บ๏ธ Project Roadmap
Here is the planned development roadmap for AutoHF. Contributions and suggestions are welcome!
Phase 1: Core Pipeline (Completed / In Progress)
- Intent-to-Task detection with keyword extraction
- Autonomous Hugging Face dataset search with multi-strategy discovery
- Intelligent dataset ranking (downloads, likes, metadata)
- AutoGluon-based automated training integration
- CLI and Python API entry points
- Configuration presets (quick/medium/high/best quality)
- Agentic architecture with TaskAgent, DatasetAgent, and DatasetRanker
Phase 2: Enhanced Model Hub
- Support for custom model fine-tuning (beyond AutoGluon tabular models)
- Integration with Hugging Face Model Hub for downloading pre-trained models
- Multi-modal support (image, audio, text classification)
- Model versioning and experiment tracking
Phase 3: Advanced Dataset Management
- Dataset quality validation (missing values, class imbalance detection)
- Automatic dataset cleaning and preprocessing recommendations
- Train/validation/test split optimization
- Dataset caching and local mirror support
Phase 4: Deployment & Serving
- Model export to ONNX, TorchScript, and CoreML formats
- REST API serving with FastAPI
- Docker containerization for easy deployment
- Batch prediction pipelines
Phase 5: Observability & Collaboration
- Training metrics dashboard
- Pipeline execution logs and audit trails
- Team collaboration features (shared datasets, model registry)
- CI/CD integration for model retraining
Phase 6: Enterprise Features
- Private Hugging Face Hub / AWS S3 / Azure Blob Storage support
- Role-based access control (RBAC)
- Scalable distributed training support
- Compliance and governance tooling
๐ License
MIT License. See LICENSE for details.
๐ค Auto-Push Scripts
AutoHF includes scripts for automated git pushing:
PowerShell (Windows)
.\git-auto-push.ps1 "Your commit message"
.\git-auto-push.ps1 "Your commit message" -Push:$false # Skip push
Batch (Windows)
git-auto-push.bat "Your commit message"
git-auto-push.bat "Your commit message" nopush # Skip push
Shell/Bash (Linux/macOS/WSL)
./git-auto-push.sh "Your commit message"
./git-auto-push.sh "Your commit message" nopush # Skip push
These scripts automatically:
- Stage all changes (
git add -A) - Check for changes
- Commit with your message
- Push to the remote repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autohf-0.1.0.tar.gz.
File metadata
- Download URL: autohf-0.1.0.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e84a0cdd74f13a069468b216286e0475ded8fad7db5efa6d46980bdfb89e5c3e
|
|
| MD5 |
414ca8ee981524cff3f0ed483fc4fcb0
|
|
| BLAKE2b-256 |
481821c996c7101692c7b37d8aa06f9b353b17038de31d21528f9894a69565ba
|
Provenance
The following attestation bundles were made for autohf-0.1.0.tar.gz:
Publisher:
publish.yml on teambugbusters00/automl-pipeine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autohf-0.1.0.tar.gz -
Subject digest:
e84a0cdd74f13a069468b216286e0475ded8fad7db5efa6d46980bdfb89e5c3e - Sigstore transparency entry: 1751617443
- Sigstore integration time:
-
Permalink:
teambugbusters00/automl-pipeine@c01cdba46971ef60e8c93811a308d445120d7578 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/teambugbusters00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c01cdba46971ef60e8c93811a308d445120d7578 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autohf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: autohf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
069fabf4933a19fb590fdff0bed00c3ee42e4e26807e001490d035d0dd5f537d
|
|
| MD5 |
ed7916652e6f1d93551acb8996e7b18e
|
|
| BLAKE2b-256 |
762dbb039e0df1fd2701ba48ae0e75be76e73da4238129dff74e3dbb0755fe69
|
Provenance
The following attestation bundles were made for autohf-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on teambugbusters00/automl-pipeine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autohf-0.1.0-py3-none-any.whl -
Subject digest:
069fabf4933a19fb590fdff0bed00c3ee42e4e26807e001490d035d0dd5f537d - Sigstore transparency entry: 1751617686
- Sigstore integration time:
-
Permalink:
teambugbusters00/automl-pipeine@c01cdba46971ef60e8c93811a308d445120d7578 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/teambugbusters00
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c01cdba46971ef60e8c93811a308d445120d7578 -
Trigger Event:
push
-
Statement type: