High-performance Llama inference engine

These details have not been verified by PyPI

Project description

Cottus Runtime

Cottus Runtime Logo

High-performance C++/CUDA LLM inference engine with Python bindings.

Cottus Runtime is a custom inference engine built from scratch for Llama architectures, prioritizing low-latency and strict memory management. It implements its own Transformer execution pipeline, KV cache management, and attention kernels.

Features

Core: Custom C++20 Transformer implementation.
Memory: PagedAttention with BlockAllocator for efficient KV cache management.
Compute: CUDA-accelerated kernels for Attention, RoPE, and GEMM (cuBLAS).
Parity: Exact token matching with HuggingFace Transformers (verified).
Interface: Clean Python API via PyBind11.

Current Limitations (v0.1)

Single‑GPU only
Limited model family support (LLaMA‑style)
CPU backend not optimized (minor numerical divergence vs CUDA)
No quantization support

These constraints are intentional for the initial release.

What to expect in future

Here is what I have planned to add to the project in later iterations. Send a PR if you want to contribute to the project.

Multi‑GPU & Distributed Execution – Enable scaling across multiple GPUs and clusters for larger models.
Expanded Model Support – Add native support for Mistral, Falcon, and other non‑LLaMA families.
Optimized CPU Backend – Introduce a high‑performance CPU path (vectorized kernels, OpenMP) and enable CPU‑only inference.
Quantization & INT8 – Provide post‑training quantization pipelines and INT8 kernels for reduced memory and faster inference.
FlashAttention‑style Kernels – Integrate memory‑efficient, block‑sparse attention kernels to cut latency and improve throughput.
Plugin System – Allow community‑contributed extensions (custom ops, alternative KV‑cache strategies).
Better Tooling – CLI utilities for model conversion, benchmarking, and profiling.

Installation

Prerequisites

NVIDIA GPU with CUDA 11/12 (Recommended)
C++ Compiler (GCC 10+ or Clang 12+)
CMake 3.18+
Python 3.8+

Install from Source

# Clone repository
git clone https://github.com/cottus-ai/cottus-runtime.git
cd cottus-runtime

# Create virtual environment (Recommended)
python3 -m venv .venv
source .venv/bin/activate

# Install in editable mode
pip install -e .

Quick Start

1. Basic Inference (Tiny Random Model)

python examples/1_basic_inference.py --device cuda

2. CPU Fallback

No GPU? No problem.

python examples/2_cpu_inference.py

3. Real Chat (TinyLlama-1.1B)

Requires ~2.2GB download.

python examples/3_tinyllama_real.py

Usage

The best way to get started is to look at the examples/ directory, which contains complete scripts for various use cases.

Basic Example

from cottus import Engine, EngineConfig
from cottus.model import load_hf_model

# 1. Load Model Weights
weights, _, _, tokenizer, _ = load_hf_model("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda")

# 2. Config
config = EngineConfig()
config.model_type = "llama"
config.hidden_dim = 2048
config.num_layers = 22
config.num_heads = 32
config.num_kv_heads = 4
config.head_dim = 64
config.intermediate_dim = 5632
config.device = "cuda"

# 3. Helpers
engine = Engine(config, weights)
input_ids = tokenizer.encode("Hello!")

# 4. Generate
output_ids = engine.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output_ids))

License

Cottus Runtime is licensed under the Apache License Version 2.0. By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jan 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cottus-0.1.0.tar.gz (553.7 kB view details)

Uploaded Jan 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl (371.1 kB view details)

Uploaded Jan 4, 2026 CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file cottus-0.1.0.tar.gz.

File metadata

Download URL: cottus-0.1.0.tar.gz
Upload date: Jan 4, 2026
Size: 553.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cottus-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1ecd785197eca1d809789424e514033bac186c1a6f7e59e766f815bf76c16c8a`
MD5	`cf09461b822fbc3def7db3be26d395ad`
BLAKE2b-256	`b4ffb0d7d6e7c972db41a0fd4eca4c717b54a3e7623c9373ac0aac99440906fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cottus-0.1.0.tar.gz:

Publisher: publish.yml on cottus-ai/cottus-runtime

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cottus-0.1.0.tar.gz
- Subject digest: 1ecd785197eca1d809789424e514033bac186c1a6f7e59e766f815bf76c16c8a
- Sigstore transparency entry: 790374291
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: cottus-ai/cottus-runtime@3cedc6238b5f4efc9968e5448b5f54c7168f05b6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/cottus-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3cedc6238b5f4efc9968e5448b5f54c7168f05b6
- Trigger Event: release

File details

Details for the file cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

Download URL: cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Upload date: Jan 4, 2026
Size: 371.1 kB
Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`ead2c92c114a2655549bb423cbea7d94ef304f614a439e9dfa0dbfcca0f3230a`
MD5	`ca337dd83107035e031d4adeee640f6d`
BLAKE2b-256	`b46602791cfa2e45b5c745de14cd249ee518f3675d1416e61d797d18e16db70d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: publish.yml on cottus-ai/cottus-runtime

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cottus-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
- Subject digest: ead2c92c114a2655549bb423cbea7d94ef304f614a439e9dfa0dbfcca0f3230a
- Sigstore transparency entry: 790374293
- Sigstore integration time: Jan 4, 2026
Source repository:
- Permalink: cottus-ai/cottus-runtime@3cedc6238b5f4efc9968e5448b5f54c7168f05b6
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/cottus-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3cedc6238b5f4efc9968e5448b5f54c7168f05b6
- Trigger Event: release

cottus 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Cottus Runtime

High-performance C++/CUDA LLM inference engine with Python bindings.

Features

Current Limitations (v0.1)

What to expect in future

Here is what I have planned to add to the project in later iterations. Send a PR if you want to contribute to the project.

Installation

Prerequisites

Install from Source

Quick Start

1. Basic Inference (Tiny Random Model)

2. CPU Fallback

3. Real Chat (TinyLlama-1.1B)

Usage

Basic Example

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance