Skip to main content

Analyze CUDA binary sizes in .so files - bloaty for CUDA kernels

Project description

cubloaty

Ever wondered what's making your CUDA binary big?

Cubloaty is a size profiler for CUDA binaries. It analyzes .so files and .cubin files to show you the size of each kernel, broken down by architecture (sm_70, sm_80, sm_90, etc.).

Think of it as bloaty, but for CUDA kernels.

Quick Example

$ cd $(python -c "import torch; print(torch.__path__[0] + '/lib')")
$ cubloaty libtorch_cuda_linalg.so                   

╭─────────────────────────────────────╮
│ 📊 CUDA Kernel Size Analysis Report │
╰─────────────────────────────────────╯
                Architecture Summary                
╭─────────────────┬─────────────────┬──────────────╮
│ Architecture          Total Size    Percentage │
├─────────────────┼─────────────────┼──────────────┤
│ SM_100                    55.2MB         18.6% │
│ SM_120                    78.5MB         26.4% │
│ SM_80                     54.3MB         18.3% │
│ SM_86                     54.2MB         18.3% │
│ SM_90                     54.7MB         18.4% │
├─────────────────┼─────────────────┼──────────────┤
│ TOTAL                    296.8MB        100.0% │
╰─────────────────┴─────────────────┴──────────────╯

                                       Top Kernels (All Architectures)                                       
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                               Total Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                                 5.2MB      1.7% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...         2.3MB      0.8% │
│      3  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...         2.0MB      0.7% │
│      4  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...         1.9MB      0.6% │
│      5  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...         1.8MB      0.6% │
│      6  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...         1.7MB      0.6% │
│      7  void cgetf2_fused_batched_kernel<32>(int, float2**, int, int, int, ...         1.7MB      0.6% │
│      8  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...         1.6MB      0.5% │
│      9  void cgetf2_fused_batched_kernel<31>(int, float2**, int, int, int, ...         1.6MB      0.5% │
│     10  void zgetf2_fused_batched_kernel<26>(int, double2**, int, int, int,...         1.5MB      0.5% │
│     11  void zgetf2_fused_batched_kernel<25>(int, double2**, int, int, int,...         1.5MB      0.5% │
│     12  void cgetf2_fused_batched_kernel<30>(int, float2**, int, int, int, ...         1.4MB      0.5% │
│     13  void cgetf2_fused_batched_kernel<29>(int, float2**, int, int, int, ...         1.4MB      0.5% │
│     14  void zgetf2_fused_batched_kernel<24>(int, double2**, int, int, int,...         1.4MB      0.5% │
│     15  void dgetf2_fused_batched_kernel<31>(int, double**, int, int, int, ...         1.3MB      0.4% │
│     16  void zgetf2_fused_batched_kernel<23>(int, double2**, int, int, int,...         1.3MB      0.4% │
│     17  void cgetf2_fused_batched_kernel<28>(int, float2**, int, int, int, ...         1.3MB      0.4% │
│     18  void dgetf2_fused_batched_kernel<32>(int, double**, int, int, int, ...         1.3MB      0.4% │
│     19  void cgetf2_fused_batched_kernel<27>(int, float2**, int, int, int, ...         1.3MB      0.4% │
│     20  void zgetf2_fused_batched_kernel<22>(int, double2**, int, int, int,...         1.2MB      0.4% │
│     21  void cgetf2_fused_batched_kernel<26>(int, float2**, int, int, int, ...         1.2MB      0.4% │
│     22  void cgetf2_fused_batched_kernel<25>(int, float2**, int, int, int, ...         1.2MB      0.4% │
│     23  void zgetrf_batched_smallsq_noshfl_kernel<32, 32>(double2**, int, i...         1.2MB      0.4% │
│     24  void dgetf2_fused_batched_kernel<30>(int, double**, int, int, int, ...         1.2MB      0.4% │
│     25  void zgetf2_fused_batched_kernel<21>(int, double2**, int, int, int,...         1.2MB      0.4% │
│     26  void dgetf2_fused_batched_kernel<29>(int, double**, int, int, int, ...         1.1MB      0.4% │
│     27  void zgetrf_batched_smallsq_noshfl_kernel<31, 32>(double2**, int, i...         1.1MB      0.4% │
│     28  void sgetf2_fused_batched_kernel<31>(int, float**, int, int, int, i...         1.1MB      0.4% │
│     29  void zgetf2_fused_batched_kernel<20>(int, double2**, int, int, int,...         1.1MB      0.4% │
│     30  void cgetf2_fused_batched_kernel<24>(int, float2**, int, int, int, ...         1.1MB      0.4% │
│    ...  (2038 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                        296.8MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

                                             Kernels for SM_100                                              
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                                     Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                                 1.6MB      2.8% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...       397.8KB      0.7% │
│      3  void cgetf2_fused_batched_kernel<32>(int, float2**, int, int, int, ...       380.0KB      0.7% │
│      4  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...       326.9KB      0.6% │
│      5  void cgetf2_fused_batched_kernel<31>(int, float2**, int, int, int, ...       323.5KB      0.6% │
│      6  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...       313.6KB      0.6% │
│      7  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...       299.8KB      0.5% │
│      8  void cgetf2_fused_batched_kernel<29>(int, float2**, int, int, int, ...       294.4KB      0.5% │
│      9  void cgetf2_fused_batched_kernel<30>(int, float2**, int, int, int, ...       289.0KB      0.5% │
│     10  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...       286.2KB      0.5% │
│     11  void dgetf2_fused_batched_kernel<31>(int, double**, int, int, int, ...       283.6KB      0.5% │
│     12  void dgetf2_fused_batched_kernel<32>(int, double**, int, int, int, ...       282.4KB      0.5% │
│     13  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...       272.6KB      0.5% │
│     14  void cgetf2_fused_batched_kernel<27>(int, float2**, int, int, int, ...       268.0KB      0.5% │
│     15  void cgetf2_fused_batched_kernel<28>(int, float2**, int, int, int, ...       264.1KB      0.5% │
│    ...  (1944 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                         55.2MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

                                             Kernels for SM_120                                              
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                                     Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                                 1.4MB      1.8% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...       878.5KB      1.1% │
│      3  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...       712.0KB      0.9% │
│      4  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...       676.2KB      0.8% │
│      5  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...       642.5KB      0.8% │
│      6  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...       609.2KB      0.8% │
│      7  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...       577.1KB      0.7% │
│      8  void zgetf2_fused_batched_kernel<26>(int, double2**, int, int, int,...       544.0KB      0.7% │
│      9  void zgetf2_fused_batched_kernel<25>(int, double2**, int, int, int,...       513.6KB      0.6% │
│     10  void zgetrf_batched_smallsq_noshfl_kernel<32, 32>(double2**, int, i...       485.5KB      0.6% │
│     11  void zgetf2_fused_batched_kernel<24>(int, double2**, int, int, int,...       484.9KB      0.6% │
│     12  void zgetrf_batched_smallsq_noshfl_kernel<31, 32>(double2**, int, i...       461.5KB      0.6% │
│     13  void zgetf2_fused_batched_kernel<23>(int, double2**, int, int, int,...       455.6KB      0.6% │
│     14  void zgetrf_batched_smallsq_noshfl_kernel<30, 32>(double2**, int, i...       437.2KB      0.5% │
│     15  void zgetf2_fused_batched_kernel<22>(int, double2**, int, int, int,...       425.1KB      0.5% │
│    ...  (1944 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                         78.5MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

                                              Kernels for SM_80                                              
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                                     Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                               709.4KB      1.3% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...       343.8KB      0.6% │
│      3  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...       328.9KB      0.6% │
│      4  void cgetf2_fused_batched_kernel<32>(int, float2**, int, int, int, ...       323.5KB      0.6% │
│      5  void cgetf2_fused_batched_kernel<31>(int, float2**, int, int, int, ...       320.5KB      0.6% │
│      6  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...       315.1KB      0.6% │
│      7  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...       299.8KB      0.5% │
│      8  void cgetf2_fused_batched_kernel<30>(int, float2**, int, int, int, ...       295.8KB      0.5% │
│      9  void cgetf2_fused_batched_kernel<29>(int, float2**, int, int, int, ...       291.2KB      0.5% │
│     10  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...       285.6KB      0.5% │
│     11  void sgetf2_native_kernel<512, 47>(int, int, float*, int, int volat...       283.4KB      0.5% │
│     12  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...       271.9KB      0.5% │
│     13  void cgetf2_fused_batched_kernel<28>(int, float2**, int, int, int, ...       269.2KB      0.5% │
│     14  void cgetf2_fused_batched_kernel<27>(int, float2**, int, int, int, ...       264.6KB      0.5% │
│     15  void sgetf2_native_kernel<512, 45>(int, int, float*, int, int volat...       261.5KB      0.5% │
│    ...  (2051 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                         54.3MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

                                              Kernels for SM_86                                              
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                                     Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                               710.2KB      1.3% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...       343.8KB      0.6% │
│      3  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...       328.9KB      0.6% │
│      4  void cgetf2_fused_batched_kernel<32>(int, float2**, int, int, int, ...       323.5KB      0.6% │
│      5  void cgetf2_fused_batched_kernel<31>(int, float2**, int, int, int, ...       320.5KB      0.6% │
│      6  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...       315.1KB      0.6% │
│      7  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...       299.8KB      0.5% │
│      8  void cgetf2_fused_batched_kernel<30>(int, float2**, int, int, int, ...       295.8KB      0.5% │
│      9  void cgetf2_fused_batched_kernel<29>(int, float2**, int, int, int, ...       291.9KB      0.5% │
│     10  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...       285.6KB      0.5% │
│     11  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...       271.9KB      0.5% │
│     12  void cgetf2_fused_batched_kernel<28>(int, float2**, int, int, int, ...       269.2KB      0.5% │
│     13  void cgetf2_fused_batched_kernel<27>(int, float2**, int, int, int, ...       264.8KB      0.5% │
│     14  void zgetf2_fused_batched_kernel<26>(int, double2**, int, int, int,...       258.4KB      0.5% │
│     15  void zgetf2_fused_batched_kernel<25>(int, double2**, int, int, int,...       246.0KB      0.4% │
│    ...  (2051 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                         54.2MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

                                              Kernels for SM_90                                              
╭────────┬────────────────────────────────────────────────────────────────────────┬──────────────┬──────────╮
│   Rank  Kernel Name                                                                     Size         % │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│      1                                                                               827.4KB      1.5% │
│      2  void zgetf2_fused_batched_kernel<32>(int, double2**, int, int, int,...       343.6KB      0.6% │
│      3  void zgetf2_fused_batched_kernel<31>(int, double2**, int, int, int,...       328.5KB      0.6% │
│      4  void cgetf2_fused_batched_kernel<32>(int, float2**, int, int, int, ...       323.8KB      0.6% │
│      5  void cgetf2_fused_batched_kernel<31>(int, float2**, int, int, int, ...       320.8KB      0.6% │
│      6  void zgetf2_fused_batched_kernel<30>(int, double2**, int, int, int,...       315.4KB      0.6% │
│      7  void zgetf2_fused_batched_kernel<29>(int, double2**, int, int, int,...       300.6KB      0.5% │
│      8  void cgetf2_fused_batched_kernel<30>(int, float2**, int, int, int, ...       296.0KB      0.5% │
│      9  void cgetf2_fused_batched_kernel<29>(int, float2**, int, int, int, ...       292.5KB      0.5% │
│     10  void zgetf2_fused_batched_kernel<28>(int, double2**, int, int, int,...       285.9KB      0.5% │
│     11  void zgetf2_fused_batched_kernel<27>(int, double2**, int, int, int,...       273.2KB      0.5% │
│     12  void cgetf2_fused_batched_kernel<28>(int, float2**, int, int, int, ...       269.5KB      0.5% │
│     13  void cgetf2_fused_batched_kernel<27>(int, float2**, int, int, int, ...       265.0KB      0.5% │
│     14  void zgetf2_fused_batched_kernel<26>(int, double2**, int, int, int,...       259.4KB      0.5% │
│     15  void zgetf2_fused_batched_kernel<25>(int, double2**, int, int, int,...       246.0KB      0.4% │
│    ...  (1945 more kernels)                                                                            │
├────────┼────────────────────────────────────────────────────────────────────────┼──────────────┼──────────┤
│         TOTAL                                                                         54.7MB    100.0% │
╰────────┴────────────────────────────────────────────────────────────────────────┴──────────────┴──────────╯

✓ Analysis complete!

Features

  • 📊 Multi-architecture analysis - See kernel sizes across sm_70, sm_80, sm_90, etc.
  • 🔍 Regex filtering - Filter kernels by name pattern
  • 📦 Multiple formats - .so libraries and standalone .cubin files
  • 🎨 Rich output - Beautiful tables or JSON for scripting
  • Fast - Analyzes binaries in seconds

Dependencies

Cubloaty requires the following tools to be installed and available in your PATH:

  • CUDA Toolkit - for cuobjdump (part of the CUDA installation)
  • binutils - for objdump, objcopy, and readelf
  • gcc/g++ - for c++filt (symbol demangling)

On Ubuntu/Debian:

sudo apt-get install binutils gcc

CUDA Toolkit can be downloaded from NVIDIA's website.

Installation

pip install -e .

Or install directly from git:

pip install git+https://github.com/flashinfer-ai/cubloaty.git

Usage

Analyze a shared library

cubloaty libmykernel.so

Analyze a cubin file

cubloaty kernel.sm_90.cubin

Show top 50 kernels

cubloaty libmykernel.so --top 50

Filter by architecture

cubloaty libmykernel.so --arch sm_90

Filter kernels by name (regex)

# Find all GEMM kernels
cubloaty libmykernel.so --filter "gemm"

# Find attention-related kernels
cubloaty libmykernel.so --filter "attention|flash"

Output as JSON

cubloaty libmykernel.so --format json > analysis.json

Show full kernel names without truncation

cubloaty libmykernel.so --full-names

Combine filters

# Show top 20 GEMM kernels for sm_90 in JSON format
cubloaty lib.so --arch sm_90 --filter "gemm" --top 20 --format json

Advanced Examples

Compare kernel sizes across architectures

# Show per-architecture breakdown
cubloaty libmykernel.so --verbose

Find the largest kernels

# Show just the top 10
cubloaty libmykernel.so --top 10

Export for further analysis

# Get JSON output and process with jq
cubloaty lib.so --format json | jq '.kernels[] | select(.size > 100000)'

Options

  file                    Path to .so or .cubin file to analyze
  --top N, -n N          Show top N kernels (default: 30)
  --arch ARCH, -a ARCH   Filter by architecture (e.g., sm_90, sm_80)
  --filter REGEX, -r     Filter kernel names by regex (case-insensitive)
  --format {table,json}  Output format (default: table)
  --full-names           Show full kernel names without truncation
  --no-color             Disable colored output
  --verbose, -v          Show detailed processing information
  --version              Show version number

How It Works

Cubloaty extracts CUDA fatbinary sections from shared libraries using objdump and objcopy, then uses cuobjdump to extract individual cubins for each architecture. It analyzes each cubin with readelf to extract kernel symbols and their sizes, and uses c++filt to demangle C++ symbol names.

Contributing

Issues and pull requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cubloaty-0.1.0b0.tar.gz (19.0 kB view details)

Uploaded Source

File details

Details for the file cubloaty-0.1.0b0.tar.gz.

File metadata

  • Download URL: cubloaty-0.1.0b0.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for cubloaty-0.1.0b0.tar.gz
Algorithm Hash digest
SHA256 ada8fc6574def224b2218b824ddf82a1b3c5954efb2c1162d5d9e8bca98cd5af
MD5 e056cc34a39e2ff4c2a23a03c5290780
BLAKE2b-256 c50b46872bdeaa3c2b2453f856da218d781c77cc94bbe2b6c5f4913a41da504b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page