A unified FastAPI server for Hugging Face model inference across 31+ tasks

These details have not been verified by PyPI

Project links

Project description

HF Inference

A small FastAPI server and CLI that make trying Hugging Face models consistent. One endpoint. Many tasks. Fewer surprises.

🚧 Status: HEAVY DEVELOPMENT – not stable yet

⚠️ Security disclaimer: by default this project loads models with transformers trust_remote_code=True in multiple runners/utilities. Remote model repos may execute arbitrary Python when loading. Do not run untrusted models in production. Prefer sandboxing (containers/VMs), pin models, audit code, and isolate credentials/network.

Why this exists 🧭

Trying different HF models is great until each one “speaks” a slightly different API dialect. This project gives you:

One simple REST endpoint for 31+ tasks
Consistent request/response shapes across models
A quick model catalog to discover candidates

In short: less boilerplate, fewer notebook tabs, more experiments per minute.

What you get 🧰

Single POST /inference endpoint that handles text, image, audio, and video tasks
31+ tasks across text, vision, audio, and multimodal
Minimal model catalog API and a lightweight HTML UI for discovery
CLI: hf-inference to start the server quickly

Installation

Prereqs (recommended):

Python 3.12+
ffmpeg (video)
tesseract-ocr, libtesseract-dev, libleptonica-dev (OCR tasks)

Install from PyPI:

pip install hf-inference

Note: Large dependencies (PyTorch, Transformers, Diffusers). Expect hefty downloads and build times.

Quick start 🚀

Start the server (default 0.0.0.0:8000):

uv run poe dev

# Health check:
curl http://localhost:8000/healthz

# First request (text generation):
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"gpt2","task":"text-generation","payload":{"prompt":"Hello HF"}}'

Docker 🐳

# Pull image (GPU):
docker pull ghcr.io/megazord-studio/hf-inference:gpu-latest

# Run (GPU, NVIDIA runtime required):
docker run --rm -it --gpus all -p 8000:8000 \
  ghcr.io/megazord-studio/hf-inference:gpu-latest \
  hf-inference --host 0.0.0.0 --port 8000

# Persist model/cache data (recommended):
docker run --rm -it --gpus all -p 8000:8000 \
  -v hf-cache:/root/.cache/huggingface \
  ghcr.io/megazord-studio/hf-inference:gpu-latest \
  hf-inference --host 0.0.0.0 --port 8000

Notes:

First run may download large models inside the container; use a volume to avoid redownloading. 🗂️
A CPU-only image may be published separately; GPU image expects NVIDIA Container Toolkit.

API overview 🔌

POST /inference

Multipart form accepting:

spec: JSON string
- model_id: str (e.g., "gpt2" or "google/vit-base-patch16-224")
- task: str (pipeline tag; see Supported tasks)
- payload: object (task-specific kwargs)
image: optional file
audio: optional file
video: optional file

Responses:

JSON for textual results
Streaming file for binary outputs when applicable (with Content-Disposition)

Example specs:

Text generation: {"model_id":"gpt2","task":"text-generation","payload":{"prompt":"..."}}`
Image classification: {"model_id":"google/vit-base-patch16-224","task":"image-classification","payload":{}}
ASR (speech): {"model_id":"openai/whisper-tiny","task":"automatic-speech-recognition","payload":{}}

GET /healthz

Returns { status, device }

GET /models?task=...

Returns minimal metadata for public models of a task: id, likes, trendingScore, downloads, gated
If task is missing, returns available_tasks (the supported tasks on this server)

GET /

Lightweight HTML table for quick sorting/filtering of a task’s models, backed by /models

Supported tasks (examples) 📋

Text: text-generation, text2text-generation, fill-mask, summarization, translation, question-answering, sentiment-analysis, token-classification
Vision: image-classification, object-detection, image-segmentation, image-to-text, image-to-image, mask-generation, zero-shot-image-classification, zero-shot-object-detection
Audio: audio-classification, automatic-speech-recognition, zero-shot-audio-classification, text-to-speech, text-to-audio
Multimodal: image-text-to-text, visual-question-answering, table-question-answering, document-question-answering, depth-estimation, video-classification

Tip: GET /models without a task returns the exact list your server supports.

Examples

# Text generation:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"gpt2","task":"text-generation","payload":{"prompt":"Hello world"}}'

# Image classification:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"google/vit-base-patch16-224","task":"image-classification","payload":{}}' \
  -F 'image=@/path/to/image.jpg'

# Speech recognition:
curl -X POST http://localhost:8000/inference \
  -F 'spec={"model_id":"openai/whisper-tiny","task":"automatic-speech-recognition","payload":{}}' \
  -F 'audio=@/path/to/audio.wav'

Testing 🧪 (read before running)

Running the full pytest suite will download and execute many real, heavy models. Expect hundreds of GB of disk usage and around 32GB of VRAM for smooth runs. Your SSD will hear about it.

Tips:

Run a subset: uv run pytest -k ["text_generation" | "image_classification"]
Faster downloads: set HF_HUB_ENABLE_HF_TRANSFER=1
Put cache on a big disk: set HF_HOME or HUGGINGFACE_HUB_CACHE to a large path
Inspect/clean cache: huggingface-cli cache info and huggingface-cli delete --help
GPU memory: prefer -tiny/-base model variants if you hit OOM

Note: some tests assume online access to Hugging Face Hub and may be slow on first run while weights download.

Configuration

CLI flags (also available via env):

--host (HF_INF_HOST) default 0.0.0.0
--port (HF_INF_PORT) default 8000
--reload dev auto-reload
--log-level (HF_INF_LOG_LEVEL) default info

Dev server (repo):

uv run uvicorn app.main:app --reload

Security notes (important) ⚠️

This project currently defaults to trust_remote_code=True in several loaders/pipelines. We verified this in the codebase across utilities and multiple runners. Treat model loading as code execution. Recommended:

Pin model revisions (commit hashes)
Audit model repositories before use
Run inside hardened containers/VMs with minimal privileges
Isolate network and secrets from the runtime process
Prefer official models from trusted orgs for production

We plan to add a global toggle to disable trust_remote_code by default and allow explicit opt-in per request.

Performance notes ⚡

GPU recommended. CPU works but can be slow depending on the model.
VRAM matters; some models require 8–16GB+. Smaller “-tiny/-base” variants help.
Mixed precision often helps; some internal runners already opt into float16 where it’s safe.

Development

Using uv and poe tasks.

Install deps: uv sync
Dev extras: uv sync --extra dev

Poe tasks (run with uv run poe ):

test: run the test suite
- uv run poe test
format: format+lint with ruff
- uv run poe format
types: mypy type-checking
- uv run poe types
dev: start the dev server with auto-reload
- uv run poe dev
security: run safety and bandit
- uv run poe security
complexity: check code complexity with radon
- uv run poe complexity
deadcode: find unused code with vulture
- uv run poe deadcode

Contributing

See CONTRIBUTING.md

Changelog

See CHANGELOG.md

License

GPL-3.0-only. See LICENSE.

If this project saves you from writing one more one-off preprocessing script for “just this model,” it’s already doing its job. A little less glue code; a lot more model poking. 😉

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf_inference-0.1.0.tar.gz (3.4 MB view details)

Uploaded Oct 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hf_inference-0.1.0-py3-none-any.whl (79.2 kB view details)

Uploaded Oct 14, 2025 Python 3

File details

Details for the file hf_inference-0.1.0.tar.gz.

File metadata

Download URL: hf_inference-0.1.0.tar.gz
Upload date: Oct 14, 2025
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for hf_inference-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8f8290be3e28d527aa243b4103a303796680caa5f6e428b1de4751db5cbbe59e`
MD5	`7017014a2559679b787283262d4ac68d`
BLAKE2b-256	`e99f4952069b451ffa84f675e93a65453fcd92e0cc160d26ef6335da8ab91d67`

See more details on using hashes here.

File details

Details for the file hf_inference-0.1.0-py3-none-any.whl.

File metadata

Download URL: hf_inference-0.1.0-py3-none-any.whl
Upload date: Oct 14, 2025
Size: 79.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.2

File hashes

Hashes for hf_inference-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f456bd502b07c0e5670eee238c7250f775d123751785e97f582dd3d5012cb4b`
MD5	`0c4de0c2727641eeb4a69cd1d0e433cb`
BLAKE2b-256	`ee4e364df1ebc390f8c5292d77ccbdb3f06ff9e53e4c2c19081b2fff7bb9c04b`

See more details on using hashes here.

hf-inference 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HF Inference

Quick links

Why this exists 🧭

What you get 🧰

Installation

Quick start 🚀

Docker 🐳

API overview 🔌

POST /inference

GET /healthz

GET /models?task=...

GET /

Supported tasks (examples) 📋

Examples

Testing 🧪 (read before running)

Configuration

Security notes (important) ⚠️

Performance notes ⚡

Development

Contributing

Changelog

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes