Analyze CUDA binary sizes in .so files - bloaty for CUDA kernels

These details have not been verified by PyPI

Project links

Homepage

Project description

cubloaty

Ever wondered what's making your CUDA binary big?

Cubloaty is a size profiler for CUDA binaries. It analyzes .so files and .cubin files to show you the size of each kernel, broken down by architecture (sm_70, sm_80, sm_90, etc.).

Think of it as bloaty, but for CUDA kernels.

Quick Example

$ cubloaty sampling.so

╭─────────────────────────────────────╮
│ 📊 CUDA Kernel Size Analysis Report │
╰─────────────────────────────────────╯
                Architecture Summary
╭─────────────────┬────────────┬─────────────────┬──────────────╮
│ Architecture    │    Kernels │      Total Size │   Percentage │
├─────────────────┼────────────┼─────────────────┼──────────────┤
│ SM_89           │        361 │           5.5MB │       100.0% │
├─────────────────┼────────────┼─────────────────┼──────────────┤
│ TOTAL           │        361 │           5.5MB │       100.0% │
╰─────────────────┴────────────┴─────────────────┴──────────────╯

                      Section Breakdown                       
╭───────────────────────────┬─────────────────┬──────────────╮
│ Section Type              │      Total Size │   % of Total │
├───────────────────────────┼─────────────────┼──────────────┤
│ Code Sections             │           4.3MB │        78.9% │
│ Metadata                  │         567.5KB │        10.1% │
│ Data Sections             │         510.4KB │         9.1% │
│ Debug Info                │          39.8KB │         0.7% │
├───────────────────────────┼─────────────────┼──────────────┤
│ TOTAL                     │           5.5MB │       100.0% │
╰───────────────────────────┴─────────────────┴──────────────╯

                              Top CUDA Kernels (All Architectures) - 361 Total
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬────────────╮
│   Rank │ Kernel Name                                                            │    Code Size │  % of Code │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼────────────┤
│      1 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... │       55.8KB │       1.2% │
│      2 │ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... │       55.5KB │       1.2% │
│      3 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... │       52.9KB │       1.2% │
│      4 │ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... │       52.6KB │       1.2% │
│      5 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... │       51.5KB │       1.1% │
│      6 │ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... │       51.2KB │       1.1% │
│      7 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... │       46.4KB │       1.0% │
│      8 │ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... │       46.2KB │       1.0% │
│      9 │ void flashinfer::sampling::TopPSamplingFromProbKernel<1024u, (cub::... │       46.0KB │       1.0% │
│     10 │ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... │       45.5KB │       1.0% │
│     11 │ void flashinfer::sampling::ChainSpeculativeSampling<512u, (cub::CUB... │       43.0KB │       1.0% │
│     12 │ void flashinfer::sampling::TopPSamplingFromProbKernel<1024u, (cub::... │       43.0KB │       1.0% │
│     13 │ void flashinfer::sampling::TopPSamplingFromProbKernel<512u, (cub::C... │       42.9KB │       1.0% │
│     14 │ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... │       42.4KB │       0.9% │
│     15 │ void flashinfer::sampling::MinPSamplingFromProbKernel<1024u, (cub::... │       39.4KB │       0.9% │
│     16 │ void flashinfer::sampling::ChainSpeculativeSampling<512u, (cub::CUB... │       38.8KB │       0.9% │
│     17 │ void flashinfer::sampling::TopPRenormProbKernel<1024u, (cub::CUB_30... │       38.4KB │       0.9% │
│     18 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... │       38.1KB │       0.8% │
│     19 │ void flashinfer::sampling::TopPSamplingFromProbKernel<512u, (cub::C... │       38.0KB │       0.8% │
│     20 │ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... │       37.9KB │       0.8% │
│     21 │ void flashinfer::sampling::MinPSamplingFromProbKernel<512u, (cub::C... │       36.9KB │       0.8% │
│     22 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<1024u, (c... │       36.4KB │       0.8% │
│     23 │ void flashinfer::sampling::TopKSamplingFromProbKernel<1024u, (cub::... │       36.2KB │       0.8% │
│     24 │ void flashinfer::sampling::MinPSamplingFromProbKernel<1024u, (cub::... │       36.1KB │       0.8% │
│     25 │ void flashinfer::sampling::TopPRenormProbKernel<512u, (cub::CUB_300... │       34.5KB │       0.8% │
│     26 │ void flashinfer::sampling::TopKMaskLogitsKernel<1024u, (cub::CUB_30... │       34.2KB │       0.8% │
│     27 │ void flashinfer::sampling::TopKTopPSamplingFromProbKernel<512u, (cu... │       33.9KB │       0.8% │
│     28 │ void flashinfer::sampling::TopKSamplingFromProbKernel<512u, (cub::C... │       33.8KB │       0.7% │
│     29 │ void flashinfer::sampling::MinPSamplingFromProbKernel<512u, (cub::C... │       31.9KB │       0.7% │
│     30 │ void flashinfer::sampling::ChainSpeculativeSampling<1024u, (cub::CU... │       31.8KB │       0.7% │
│    ... │ (331 more kernels)                                                     │              │            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼────────────┤
│        │ TOTAL KERNEL CODE                                                      │        4.4MB │   80.1% of │
│        │                                                                        │              │       file │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴────────────╯

✓ Analysis complete!

Features

📊 Multi-architecture analysis - See kernel sizes across sm_70, sm_80, sm_90, etc.
🔍 Regex filtering - Filter kernels by name pattern
📦 Multiple formats - .so libraries and standalone .cubin files
🎨 Rich output - Beautiful tables or JSON for scripting
⚡ Fast - Analyzes binaries in seconds

Dependencies

Cubloaty requires the following tools to be installed and available in your PATH:

CUDA Toolkit - for cuobjdump (part of the CUDA installation)
binutils - for objdump, objcopy, and readelf
gcc/g++ - for c++filt (symbol demangling)

On Ubuntu/Debian:

sudo apt-get install binutils gcc

CUDA Toolkit can be downloaded from NVIDIA's website.

Installation

Install the package from pypi:

pip install cubloaty

Or git clone the repo and install from source:

git clone https://github.com/flashinfer-ai/cubloaty.git
pip install -e . -v  # editable mode

Usage

Analyze a shared library

cubloaty libmykernel.so

Analyze a cubin file

cubloaty kernel.sm_90.cubin

Show top 50 kernels

cubloaty libmykernel.so --top 50

Filter by architecture

cubloaty libmykernel.so --arch sm_90

Filter kernels by name (regex)

# Find all GEMM kernels
cubloaty libmykernel.so --filter "gemm"

# Find attention-related kernels
cubloaty libmykernel.so --filter "attention|flash"

Output as JSON

cubloaty libmykernel.so --format json > analysis.json

Show full kernel names without truncation

cubloaty libmykernel.so --full-names

Combine filters

# Show top 20 GEMM kernels for sm_90 in JSON format
cubloaty lib.so --arch sm_90 --filter "gemm" --top 20 --format json

Advanced Examples

Compare kernel sizes across architectures

# Show per-architecture breakdown
cubloaty libmykernel.so --verbose

Find the largest kernels

# Show just the top 10
cubloaty libmykernel.so --top 10

Export for further analysis

# Get JSON output and process with jq
cubloaty lib.so --format json | jq '.kernels[] | select(.size > 100000)'

Options

  file                    Path to .so or .cubin file to analyze
  --top N, -n N          Show top N kernels (default: 30)
  --arch ARCH, -a ARCH   Filter by architecture (e.g., sm_90, sm_80)
  --filter REGEX, -r     Filter kernel names by regex (case-insensitive)
  --format {table,json}  Output format (default: table)
  --full-names           Show full kernel names without truncation
  --no-color             Disable colored output
  --verbose, -v          Show detailed processing information
  --version              Show version number

How It Works

Cubloaty extracts CUDA fatbinary sections from shared libraries using objdump and objcopy, then uses cuobjdump to extract individual cubins for each architecture. It analyzes each cubin with readelf to extract kernel symbols and their sizes, and uses c++filt to demangle C++ symbol names.

Contributing

Issues and pull requests are welcome!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0b3 pre-release

Jan 15, 2026

0.1.0b2 pre-release

Oct 7, 2025

0.1.0b1 pre-release

Oct 7, 2025

0.1.0b0 pre-release

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cubloaty-0.1.0b3.tar.gz (19.7 kB view details)

Uploaded Jan 15, 2026 Source

File details

Details for the file cubloaty-0.1.0b3.tar.gz.

File metadata

Download URL: cubloaty-0.1.0b3.tar.gz
Upload date: Jan 15, 2026
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.13

File hashes

Hashes for cubloaty-0.1.0b3.tar.gz
Algorithm	Hash digest
SHA256	`6c79738ba3cdcc0fb71cdcdfe78bc23ce95df57992124f458d802f5bb70736fc`
MD5	`64b4ce593dee41421f25e33f55694330`
BLAKE2b-256	`541565102857a24c1c452494f9d70882c0acc4faf7fb0bf68b710005595213e5`

See more details on using hashes here.

cubloaty 0.1.0b3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cubloaty

Quick Example

Features

Dependencies

Installation

Usage

Analyze a shared library

Analyze a cubin file

Show top 50 kernels

Filter by architecture

Filter kernels by name (regex)

Output as JSON

Show full kernel names without truncation

Combine filters

Advanced Examples

Compare kernel sizes across architectures

Find the largest kernels

Export for further analysis

Options

How It Works

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes