Skip to main content

A unified FastAPI server for Hugging Face model inference across 31+ tasks

Project description

HF Inference

A small FastAPI server and CLI that make trying Hugging Face models consistent. One endpoint. Many tasks. Fewer surprises.

🚧 Status: HEAVY DEVELOPMENT – not stable yet

⚠️ Security disclaimer: by default this project loads models with transformers trust_remote_code=True in multiple runners/utilities. Remote model repos may execute arbitrary Python when loading. Do not run untrusted models in production. Prefer sandboxing (containers/VMs), pin models, audit code, and isolate credentials/network.

Quick links

  • Install
  • Quick start
  • Docker
  • API overview
  • Supported tasks
  • Examples
  • Testing (read this)
  • Security notes
  • Performance notes
  • Development
  • Contributing

Why this exists 🧭

Trying different HF models is great until each one “speaks” a slightly different API dialect. This project gives you:

  • One simple REST endpoint for 31+ tasks
  • Consistent request/response shapes across models
  • A quick model catalog to discover candidates

In short: less boilerplate, fewer notebook tabs, more experiments per minute.

What you get 🧰

  • Single POST /inference endpoint that handles text, image, audio, and video tasks
  • 31+ tasks across text, vision, audio, and multimodal
  • Minimal model catalog API and a lightweight HTML UI for discovery
  • CLI: hf-inference to start the server quickly

Installation

Prereqs (recommended):

  • Python 3.12+
  • ffmpeg (video)
  • tesseract-ocr, libtesseract-dev, libleptonica-dev (OCR tasks)

Install from PyPI:

  • pip install hf-inference

Note: Large dependencies (PyTorch, Transformers, Diffusers). Expect hefty downloads and build times.

Quick start 🚀

Start the server (default 0.0.0.0:8000):

  • uv run poe dev
# Health check:
curl http://localhost:8000/healthz

# First request (text generation):
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"gpt2","task":"text-generation","payload":{"prompt":"Hello HF"}}'

Docker 🐳

# Pull image (GPU):
docker pull ghcr.io/megazord-studio/hf-inference:gpu-latest

# Run (GPU, NVIDIA runtime required):
docker run --rm -it --gpus all -p 8000:8000 \
  ghcr.io/megazord-studio/hf-inference:gpu-latest \
  hf-inference --host 0.0.0.0 --port 8000

# Persist model/cache data (recommended):
docker run --rm -it --gpus all -p 8000:8000 \
  -v hf-cache:/root/.cache/huggingface \
  ghcr.io/megazord-studio/hf-inference:gpu-latest \
  hf-inference --host 0.0.0.0 --port 8000

Notes:

  • First run may download large models inside the container; use a volume to avoid redownloading. 🗂️
  • A CPU-only image may be published separately; GPU image expects NVIDIA Container Toolkit.

API overview 🔌

POST /inference

Multipart form accepting:

  • spec: JSON string
    • model_id: str (e.g., "gpt2" or "google/vit-base-patch16-224")
    • task: str (pipeline tag; see Supported tasks)
    • payload: object (task-specific kwargs)
  • image: optional file
  • audio: optional file
  • video: optional file

Responses:

  • JSON for textual results
  • Streaming file for binary outputs when applicable (with Content-Disposition)

Example specs:

  • Text generation: {"model_id":"gpt2","task":"text-generation","payload":{"prompt":"..."}}`
  • Image classification: {"model_id":"google/vit-base-patch16-224","task":"image-classification","payload":{}}
  • ASR (speech): {"model_id":"openai/whisper-tiny","task":"automatic-speech-recognition","payload":{}}

GET /healthz

  • Returns { status, device }

GET /models?task=...

  • Returns minimal metadata for public models of a task: id, likes, trendingScore, downloads, gated
  • If task is missing, returns available_tasks (the supported tasks on this server)

GET /

  • Lightweight HTML table for quick sorting/filtering of a task’s models, backed by /models

Supported tasks (examples) 📋

  • Text: text-generation, text2text-generation, fill-mask, summarization, translation, question-answering, sentiment-analysis, token-classification
  • Vision: image-classification, object-detection, image-segmentation, image-to-text, image-to-image, mask-generation, zero-shot-image-classification, zero-shot-object-detection
  • Audio: audio-classification, automatic-speech-recognition, zero-shot-audio-classification, text-to-speech, text-to-audio
  • Multimodal: image-text-to-text, visual-question-answering, table-question-answering, document-question-answering, depth-estimation, video-classification

Tip: GET /models without a task returns the exact list your server supports.

Examples

# Text generation:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"gpt2","task":"text-generation","payload":{"prompt":"Hello world"}}'

# Image classification:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"google/vit-base-patch16-224","task":"image-classification","payload":{}}' \
  -F 'image=@/path/to/image.jpg'

# Speech recognition:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"openai/whisper-tiny","task":"automatic-speech-recognition","payload":{}}' \
  -F 'audio=@/path/to/audio.wav'

Testing 🧪 (read before running)

Running the full pytest suite will download and execute many real, heavy models. Expect hundreds of GB of disk usage and around 32GB of VRAM for smooth runs. Your SSD will hear about it.

Tips:

  • Run a subset: uv run pytest -k ["text_generation" | "image_classification"]
  • Faster downloads: set HF_HUB_ENABLE_HF_TRANSFER=1
  • Put cache on a big disk: set HF_HOME or HUGGINGFACE_HUB_CACHE to a large path
  • Inspect/clean cache: huggingface-cli cache info and huggingface-cli delete --help
  • GPU memory: prefer -tiny/-base model variants if you hit OOM

Note: some tests assume online access to Hugging Face Hub and may be slow on first run while weights download.

Configuration

CLI flags (also available via env):

  • --host (HF_INF_HOST) default 0.0.0.0
  • --port (HF_INF_PORT) default 8000
  • --reload dev auto-reload
  • --log-level (HF_INF_LOG_LEVEL) default info

Dev server (repo):

  • uv run uvicorn app.main:app --reload

Security notes (important) ⚠️

This project currently defaults to trust_remote_code=True in several loaders/pipelines. We verified this in the codebase across utilities and multiple runners. Treat model loading as code execution. Recommended:

  • Pin model revisions (commit hashes)
  • Audit model repositories before use
  • Run inside hardened containers/VMs with minimal privileges
  • Isolate network and secrets from the runtime process
  • Prefer official models from trusted orgs for production

We plan to add a global toggle to disable trust_remote_code by default and allow explicit opt-in per request.

Performance notes ⚡

  • GPU recommended. CPU works but can be slow depending on the model.
  • VRAM matters; some models require 8–16GB+. Smaller “-tiny/-base” variants help.
  • Mixed precision often helps; some internal runners already opt into float16 where it’s safe.

Development

Using uv and poe tasks.

  • Install deps: uv sync
  • Dev extras: uv sync --extra dev

Poe tasks (run with uv run poe ):

  • test: run the test suite
    • uv run poe test
  • format: format+lint with ruff
    • uv run poe format
  • types: mypy type-checking
    • uv run poe types
  • dev: start the dev server with auto-reload
    • uv run poe dev
  • security: run safety and bandit
    • uv run poe security
  • complexity: check code complexity with radon
    • uv run poe complexity
  • deadcode: find unused code with vulture
    • uv run poe deadcode

Contributing

See CONTRIBUTING.md

Changelog

See CHANGELOG.md

License

GPL-3.0-only. See LICENSE.


If this project saves you from writing one more one-off preprocessing script for “just this model,” it’s already doing its job. A little less glue code; a lot more model poking. 😉

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_inference-0.1.0.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hf_inference-0.1.0-py3-none-any.whl (79.2 kB view details)

Uploaded Python 3

File details

Details for the file hf_inference-0.1.0.tar.gz.

File metadata

  • Download URL: hf_inference-0.1.0.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.2

File hashes

Hashes for hf_inference-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8f8290be3e28d527aa243b4103a303796680caa5f6e428b1de4751db5cbbe59e
MD5 7017014a2559679b787283262d4ac68d
BLAKE2b-256 e99f4952069b451ffa84f675e93a65453fcd92e0cc160d26ef6335da8ab91d67

See more details on using hashes here.

File details

Details for the file hf_inference-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_inference-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f456bd502b07c0e5670eee238c7250f775d123751785e97f582dd3d5012cb4b
MD5 0c4de0c2727641eeb4a69cd1d0e433cb
BLAKE2b-256 ee4e364df1ebc390f8c5292d77ccbdb3f06ff9e53e4c2c19081b2fff7bb9c04b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page