Serving LLMs at Scale

These details have not been verified by PyPI

Project links

Project description

Breathing Life into Language

aphrodite

Aphrodite is an inference engine that optimizes the serving of HuggingFace-compatible models at scale. Built on vLLM's Paged Attention technology, it delivers high-performance model inference for multiple concurrent users. Developed through a collaboration between PygmalionAI and Ruliad, Aphrodite serves as the backend engine powering both organizations' chat platforms and API infrastructure.

Aphrodite builds upon and integrates the exceptional work from various projects, primarily vLLM.

Features

Continuous Batching
Efficient K/V management with PagedAttention from vLLM
Optimized CUDA kernels for improved inference
Quantization support via AQLM, AutoRound, AWQ, BitNet, Bitsandbytes, EETQ, GGUF, GPTQ, QuIP#, SqueezeLLM, Marlin, FP2-FP12 [1] [2] [3], NVIDIA ModelOpt, TorchAO, VPTQ, compressed_tensors, MXFP4, and more.
Distributed inference
8-bit KV Cache for higher context lengths and throughput, at both FP8 E5M3 and E4M3 formats
Support for modern samplers such as DRY, XTC, Mirostat, and more
Disaggregated inference
Speculative decoding
Multimodal support
Multi-LoRA support

Quickstart

Install the engine:

pip install -U aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl

Then launch a model:

aphrodite run Qwen/Qwen3-0.6B

If you're not serving at scale, you can append the --single-user-mode flag to limit memory usage.

This will create a OpenAI-compatible API server that can be accessed at port 2242 of the localhost. You can plug in the API into a UI that supports OpenAI, such as SillyTavern.

Please refer to the documentation for the full list of arguments and flags you can pass to the engine, or simply run aphrodite run -h to see the full list of arguments.

You can play around with the engine in the demo here:

Docker

Additionally, we provide a Docker image for easy deployment. Here's a basic command to get you started:

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    #--env "CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7" \
    -p 2242:2242 \
    --ipc=host \
    alpindale/aphrodite-openai:latest \
    --model NousResearch/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 8 \
    --api-key "sk-empty"

This will pull the Aphrodite Engine image, and launch the engine with the Llama-3.1-8B-Instruct model at port 2242.

Requirements

Operating System: Linux, Windows (WSL2)
Python: 3.9 to 3.12

Build Requirements:

CUDA >= 12

For supported devices, see here. Generally speaking, all semi-modern GPUs are supported - down to Pascal (GTX 10xx, P40, etc.) We also support AMD GPUs, Intel CPUs and GPUs, Google TPU, and AWS Inferentia.

Notes

By design, Aphrodite takes up 90% of your GPU's VRAM. If you're not serving an LLM at scale, you may want to limit the amount of memory it takes up. You can do this in the API example by launching the server with the --gpu-memory-utilization 0.6 (0.6 means 60%), or --single-user-mode to only allocate as much memory as needed for a single sequence.
You can view the full list of commands by running aphrodite run --help.

Acknowledgements

Aphrodite Engine would have not been possible without the phenomenal work of other open-source projects. A (non-exhaustive) list:

Contributing

Everyone is welcome to contribute. You can support the project by opening Pull Requests for new features, fixes, or general UX improvements.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.10.0

Nov 8, 2025

0.6.5

Dec 22, 2024

0.6.4.post1

Dec 4, 2024

0.6.3.post1

Nov 2, 2024

0.6.3

Nov 2, 2024

0.6.2.post1

Oct 16, 2024

0.6.2

Sep 22, 2024

0.6.1.post1

Sep 13, 2024

0.6.1

Sep 12, 2024

0.6.0.post1

Sep 6, 2024

0.6.0

Sep 3, 2024

0.5.1

Mar 15, 2024

0.5.0

Mar 11, 2024

0.4.9

Feb 3, 2024

0.4.8

Feb 3, 2024

0.4.7

Feb 3, 2024

0.4.6

Jan 14, 2024

0.4.5

Dec 19, 2023

0.4.3

Dec 12, 2023

0.4.2

Nov 13, 2023

0.4.1

Nov 3, 2023

0.4

Nov 3, 2023

0.3.7

Oct 24, 2023

0.3.6

Oct 13, 2023

0.3.5

Oct 9, 2023

0.3.4

Oct 6, 2023

0.3.3

Oct 2, 2023

0.3.2

Sep 30, 2023

0.3.1

Sep 29, 2023

0.3

Sep 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aphrodite_engine-0.10.0.tar.gz (24.4 MB view details)

Uploaded Nov 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aphrodite_engine-0.10.0-py3-none-any.whl (16.8 MB view details)

Uploaded Nov 8, 2025 Python 3

File details

Details for the file aphrodite_engine-0.10.0.tar.gz.

File metadata

Download URL: aphrodite_engine-0.10.0.tar.gz
Upload date: Nov 8, 2025
Size: 24.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for aphrodite_engine-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`0fa00409bb05f419ce9ac0b0fb48986923d35aea3e61a7ae643c0259ddb4009f`
MD5	`943b137d0a6ee4a7e95d3032a49acace`
BLAKE2b-256	`40377bb3b296e96620b90ab7d869f4175c0aa32da93f967700949cc2d9506e90`

See more details on using hashes here.

File details

Details for the file aphrodite_engine-0.10.0-py3-none-any.whl.

File metadata

Download URL: aphrodite_engine-0.10.0-py3-none-any.whl
Upload date: Nov 8, 2025
Size: 16.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.7

File hashes

Hashes for aphrodite_engine-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0dda9b5d69345e7328abed05bdcc23125ed24cf7cdcefe03028aaf8a5ae7b21`
MD5	`79719e9098850d239ddd9565bf3f8314`
BLAKE2b-256	`8e7d53c7b0aac3bd4a9481c543a47286c078a5d0771cef7cac8635daf7f3ae11`

See more details on using hashes here.

aphrodite-engine 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Breathing Life into Language

Features

Quickstart

Docker

Requirements

Build Requirements:

Notes

Acknowledgements

Sponsors

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes