Lightning Core: macOS-first CUDA-style runtime with Metal backend

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fhzk1022

These details have not been verified by PyPI

Project description

1. Title

Lightning Core: Metal-First Runtime for Attention, MatMul, and Fused Inference Pipelines

2. Badges

3. One-line Summary

Lightning Core is a macOS-first, Metal-backed runtime that provides low-level control (resident IO, policy routing, fused paths) with easy Python APIs.

4. Abstract

Lightning Core targets high-iteration experimentation on Apple Silicon by combining:

custom C++ kernels and runtime scheduling,
Metal fastpaths with CPU fallback/crossover,
pybind-based Python APIs for rapid operator and pipeline testing.

The project is positioned between a research runtime and a production-oriented operator engine. It emphasizes repeatable benchmarking, explicit execution policy control, and practical end-to-end pipeline composition (conv -> attention, FFN, LN -> projection).

5. Motivation / Problem Statement

Most deep-learning tooling assumes CUDA-first execution, while many practical local environments are macOS + Apple Silicon. This creates a gap:

kernel-level optimization ideas are hard to test quickly on macOS,
launch/memory overhead dominates small and repeated workloads,
framework-level abstractions can hide runtime policy decisions.

Lightning Core addresses this by exposing runtime scheduling primitives and fastpaths directly.

6. Key Idea

Treat execution policy as a first-class runtime object:

choose upload/download/sync behavior per call,
persist resident sessions for repeated loops,
auto-tune per-shape kernel/mode choices,
fuse where useful, and fallback/crossover when launch overhead dominates.

7. Contributions

Metal-first runtime for selected tensor/ops and attention workloads.
Resident execution model for amortizing transfer/sync overhead.
Auto-tuned matmul and attention mode selection with persisted cache.
High-level integrated APIs for conv/attention pipeline composition.
Python-friendly convenience APIs (matmul2d, attention2d, tensor constructors) without changing fast-path kernels.
Benchmark harnesses and reproducibility artifacts.

8. System Architecture

flowchart LR
    A[Python User API] --> B[pybind Bindings]
    B --> C[Runtime Core]
    C --> D[Ops Layer: matmul/conv/attention/vector]
    D --> E[Metal Fastpath]
    D --> F[CPU Fallback/Crossover]
    C --> G[Policy Engine]
    C --> H[Auto-tune Cache]
    C --> I[Resident Sessions]

9. Execution Model

sequenceDiagram
    participant P as Python Caller
    participant R as Runtime
    participant M as Metal Queue

    P->>R: op(input, policy)
    R->>R: validate + pick path (tuned/policy)
    alt resident start
        R->>M: upload once + launch
    else resident run
        R->>M: launch only
    else finish/sync
        R->>M: launch + sync + optional download
    end
    M-->>R: completion
    R-->>P: output/status

10. Fused Pipeline Design

flowchart TD
    X[Conv2D + ReLU] --> Y[QKV Arrange]
    Y --> Z[Attention]
    Z --> W[Projection / Next Block]

    subgraph Optimization Axes
      A1[Into buffers]
      A2[Resident reuse]
      A3[Fused or split policy]
      A4[Auto-tuned mode]
    end

    A1 --> X
    A2 --> X
    A2 --> Z
    A3 --> Y
    A4 --> Z

11. Benchmark Setup

Latest README snapshot setup (local run, 2026-03-30):

Device: Apple Silicon macOS (Metal enabled)
Runtime: lightning_core editable build
Torch: 2.11.0 (MPS available)
Bench suites:
- ai_model_all_bench.py
- ml_all_bench.py
- dl_all_bench.py

12. Benchmark Results

Representative rows from the latest local run:

Suite	Case	Lightning Core ms	Torch MPS ms	Best-vs-MPS
Kernel	attention `seq=8,dim=16`	0.00083	0.26909	324.33x
Kernel	conv `1x3x16x16 -> 16, k=3`	0.18535	0.25783	1.39x
Kernel	gemm `1024^3`	0.16867	0.81028	4.80x
Pipeline	conv->attn `seq=192,dim=48`	0.43108	0.42548	1.02x (best path)
Pipeline	FFN `batch=1024`	0.37234	1.45200	3.90x
Pipeline	LN->Proj `batch=2048`	0.31055	1.21542	3.91x
ML	Linear `4096x1024 -> 1024`	0.63256	3.89491	6.16x
DL Sweep	GEMM `4096x1024x4096`	2.46957	9.74384	3.95x

Snapshot speedup visualization (ours_best_vs_mps, higher is better):

xychart-beta
    title "Lightning Core Speedup vs Torch MPS (2026-03-30)"
    x-axis ["Attn Micro Avg","Conv Avg","GEMM Avg","Conv->Attn Avg","FFN Avg","LN->Proj Avg","ML Avg","DL Avg"]
    y-axis "x Faster" 0 --> 260
    bar [247.31,1.14,6.23,1.05,3.85,3.76,12.91,3.30]

13. Key Findings / Insights

Tiny attention shapes are launch-bound on GPU; crossover/routing is critical.
Resident execution strongly improves repeated matmul/pipe throughput.
For large GEMM, tuned mode selection avoids one-size-fits-all regressions.
"Best path" metric is useful because integrated and direct paths can win on different shapes.

14. Limitations

Scope is selective operators/pipelines, not a full DL framework.
Performance can vary by thermals, OS/driver version, and benchmark ordering.
Some APIs are still low-level by design.
Multi-head/full-transformer framework parity is not the goal yet.

15. Future Work

Expanded fused kernels (attention and projection blocks).
Better mixed-precision controls and calibration tooling.
Broader operator coverage and shape-specialized kernels.
More stable cross-device benchmark CI baselines.

16. Installation

From PyPI:

python -m pip install -U lightning-core

From source:

git clone https://github.com/wnsgus00114-droid/lightning-core.git
cd lightning-core
python -m pip install .

17. Quick Start

import numpy as np
import lightning_core as lc

print("backend:", lc.backend_name())

a = np.random.rand(128, 256).astype(np.float32)
b = np.random.rand(256, 64).astype(np.float32)
y = lc.matmul2d(a, b, "metal")
print(y.shape)

18. Core API Overview

Core categories:

Runtime: backend_name, metal_available, cuda_available
Tensor: Tensor, Tensor64, TensorView
Ops: matmul/conv/vector/matrix (+ resident sessions)
Attention: forward/train + policy + session
Integrated: high-level conv/attention pipeline APIs

19. Input Rules

Use float32 NumPy arrays for fast paths.
Prefer contiguous arrays (np.ascontiguousarray).
For *_into APIs, output buffer shape must exactly match expected shape.
Device string must be one of: "metal", "cpu", "cuda" (if available).

20. MatMul Usage

import numpy as np
import lightning_core as lc

a = np.random.rand(512, 1024).astype(np.float32)
b = np.random.rand(1024, 512).astype(np.float32)

# easy API (shape inferred)
out = lc.matmul2d(a, b, "metal")

# into API (avoid re-allocation)
out2 = np.empty((512, 512), dtype=np.float32)
lc.matmul2d_into(a, b, out2, "metal")

21. Attention Usage

import numpy as np
import lightning_core as lc

q = np.random.rand(8, 16).astype(np.float32)
k = np.random.rand(8, 16).astype(np.float32)
v = np.random.rand(8, 16).astype(np.float32)

out = lc.attention2d(q, k, v, False, "metal")
out_into = np.empty_like(q)
lc.attention2d_into(q, k, v, out_into, False, "metal")

22. Convolution Usage

import numpy as np
import lightning_core as lc

x = np.random.rand(1, 3, 16, 16).astype(np.float32)
w = np.random.rand(16, 3, 3, 3).astype(np.float32)
b = np.random.rand(16).astype(np.float32)

y = lc.conv2d_nchw(x, w, b, 1, 1, 1, 1, "metal")

23. Resident Blocks

Resident sessions reduce repeated IO/sync overhead:

import numpy as np
import lightning_core as lc

a = np.random.rand(1024, 1024).astype(np.float32)
b = np.random.rand(1024, 1024).astype(np.float32)
out = np.empty((1024, 1024), dtype=np.float32)

sess = lc.matmul2d_resident_session(a, b)
sess.start_into(a, b, out)
sess.run_batch_sync_no_download_into(a, b, out, 8)

24. Pipeline Usage

Integrated APIs are exposed in both lightning_core (legacy-prefixed names) and lightning_core.api (clean names).

import numpy as np
import lightning_core as lc

x = np.random.rand(1, 3, 8, 8).astype(np.float32)
w = np.random.rand(16, 3, 3, 3).astype(np.float32)
b = np.random.rand(16).astype(np.float32)

# High-level conv+relu
y = lc.api.conv_relu_nchw(x, w, b, stride_h=1, stride_w=1, pad_h=1, pad_w=1, device="metal")

# Integrated conv->attention path
seq_len, head_dim = 96, 48
z = lc.api.conv_attention_torchstrong_nchw(
    x, w, b, seq_len=seq_len, head_dim=head_dim, stride_h=1, stride_w=1, pad_h=1, pad_w=1, device="metal"
)
print(y.shape, z.shape)

Typical optimization pattern:

use *_into to reuse preallocated output buffers,
keep data contiguous in float32,
avoid host/device round-trips between connected blocks.

25. Performance Tips

Reuse output buffers with *_into APIs.
Use resident sessions for repeated loops.
Keep inputs contiguous float32.
Separate one-shot latency benchmarks and steady-state throughput benchmarks.
Warm up before measurement.

26. API Examples

More examples:

docs/quickstart.md
docs/advanced.md
examples/ and benchmark source files under benchmarks/

27. Benchmark Overview

Lightning Core includes:

Native C++ benchmark binaries in benchmarks/
Python benchmark scripts in benchmark workspace (when available)
CSV/JSON artifacts for reproducibility and comparison

28. Benchmark Directory Structure

benchmarks/
  bench_attention.cpp
  bench_vector_add.cpp
  bench_matmul.cpp
  bench_matrix_ops.cpp
  bench_transformer.cpp
  bench_lstm_rnn.cpp
  bench_cnn_dnn.cpp
  bench_vlm.cpp
  sweep_matrix_ops.sh
  large_gemm_auto_sweep.py
  generate_cross_suite_summary.py

Workspace-level scripts used for README snapshot (outside repo root in this environment):

ai_model_all_bench.py
ml_all_bench.py
dl_all_bench.py

29. How to Run Benchmarks

Native C++ benchmarks:

cmake -S . -B build -DCJ_ENABLE_METAL=ON -DCJ_BUILD_BENCHMARKS=ON -DCJ_BUILD_PYTHON=ON
cmake --build build -j

./build/benchmarks/bench_vector_add
./build/benchmarks/bench_attention
./build/benchmarks/bench_matmul
./build/benchmarks/bench_matrix_ops
./build/benchmarks/bench_transformer
./build/benchmarks/bench_lstm_rnn
./build/benchmarks/bench_cnn_dnn
./build/benchmarks/bench_vlm

Workspace Python benchmark harness (if present):

python ../ai_model_all_bench.py
python ../ml_all_bench.py
python ../dl_all_bench.py

30. Benchmark Output Files

Typical outputs:

benchmark_results/kernel_bench.csv
benchmark_results/pipeline_bench.csv
benchmark_results/ml_all_bench.csv
benchmark_results/large_gemm_auto_sweep.csv
corresponding .json files for each suite

Native build outputs can also appear under build/benchmarks/*.csv.

31. How to Read the Results

Common columns:

lightning_core_ms: LC runtime latency
torch_mps_ms: Torch MPS latency
integrated_api_ms: higher-level integrated API latency
ours_best_vs_mps: best(LC, integrated) against Torch MPS
- > 1.0: ours is faster
- < 1.0: Torch MPS is faster

32. Reproducing README Numbers

Numbers in this README were refreshed on 2026-03-30 with:

# from workspace root (recommended in this repo layout)
python ai_model_all_bench.py
python ml_all_bench.py
python dl_all_bench.py

Alternative (from lightning-core/ directory):

python ../ai_model_all_bench.py
python ../ml_all_bench.py
python ../dl_all_bench.py

Then checked by scanning ours_best_vs_mps from:

benchmark_results/kernel_bench.csv
benchmark_results/pipeline_bench.csv
benchmark_results/ml_all_bench.csv
benchmark_results/large_gemm_auto_sweep.csv

33. Benchmark Methodology Notes

Warmup iterations are used before timed iterations.
Some suites use repeated trials and median/robust center.
Thermal state can affect absolute numbers; compare relative metrics and rerun if needed.
For fair comparison, synchronize MPS paths and separate one-shot vs resident scenarios.

34. Repository Structure

include/lightning_core/         # public wrapper headers
include/lightning_core/core/    # canonical core headers
include/cudajun/                # compatibility forwarding headers
src/                            # runtime + op implementations
python/bindings/                # pybind11 bindings
benchmarks/                     # native benchmark sources/scripts
tests/                          # C++ unit tests
docs/                           # quickstart/advanced/contributor docs

35. Roadmap

Expand fused operator coverage for end-to-end pipelines.
Improve precision/performance control surfaces.
Harden reproducibility tooling and CI benchmarking.
Continue Python ergonomics while preserving low-level controls.

36. Citation

If you use Lightning Core in research, please cite it as software:

@software{lightning_core,
  title = {Lightning Core: Metal-First Runtime for Attention and Fused Pipelines},
  author = {Beak, JunHyeon},
  year = {2026},
  url = {https://github.com/wnsgus00114-droid/lightning-core}
}

37. License

This project is licensed under Kwangwoon University License 1.0 (KWU-1.0). See LICENSE.

38. Contributing

Please read:

docs/contributor.md

General flow:

Open an issue/discussion for major changes.
Keep PRs focused and benchmark-backed when performance-sensitive.
Include reproduction steps for behavior/perf changes.

39. Project Status

Active development (beta).

Lightning Core is stable enough for experimentation and benchmarking, while APIs and internals continue to evolve quickly.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fhzk1022

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.7

Apr 15, 2026

0.5.6

Apr 14, 2026

0.4.7

Apr 12, 2026

0.3.4

Apr 11, 2026

0.2.17

Apr 10, 2026

0.2.8

Apr 9, 2026

0.2.7

Apr 9, 2026

0.2.2

Apr 9, 2026

0.1.32

Apr 8, 2026

0.1.27

Apr 8, 2026

0.1.22

Apr 8, 2026

0.1.20

Apr 8, 2026

0.1.19

Apr 8, 2026

0.1.18

Apr 8, 2026

0.1.17

Apr 7, 2026

0.1.16

Apr 2, 2026

0.1.15

Apr 2, 2026

0.1.14

Apr 1, 2026

0.1.13

Apr 1, 2026

0.1.10

Mar 31, 2026

0.1.9

Mar 31, 2026

0.1.8

Mar 31, 2026

0.1.7

Mar 31, 2026

0.1.6

Mar 30, 2026

0.1.5

Mar 29, 2026

This version

0.1.4

Mar 29, 2026

0.1.3

Mar 28, 2026

0.1.2

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightning_core-0.1.4.tar.gz (116.8 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl (267.7 kB view details)

Uploaded Mar 29, 2026 CPython 3.12macOS 14.0+ ARM64

File details

Details for the file lightning_core-0.1.4.tar.gz.

File metadata

Download URL: lightning_core-0.1.4.tar.gz
Upload date: Mar 29, 2026
Size: 116.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lightning_core-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`deac7edc273df61335501c2add98bc2a9b8ef08143bd6f92f88c15312115a4a6`
MD5	`06ecb26965864793e5c85623c62ab432`
BLAKE2b-256	`9fa39ad02cbb370398a4b2a11baa3c1fcc4336af1f4619b1691ba5ec19147f60`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_core-0.1.4.tar.gz:

Publisher: python-wheel-publish.yml on wnsgus00114-droid/lightning-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lightning_core-0.1.4.tar.gz
- Subject digest: deac7edc273df61335501c2add98bc2a9b8ef08143bd6f92f88c15312115a4a6
- Sigstore transparency entry: 1195243539
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: wnsgus00114-droid/lightning-core@1d16c7a660d08a539a775c9a5060cb0a65d3981a
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/wnsgus00114-droid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-wheel-publish.yml@1d16c7a660d08a539a775c9a5060cb0a65d3981a
- Trigger Event: push

File details

Details for the file lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

Download URL: lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl
Upload date: Mar 29, 2026
Size: 267.7 kB
Tags: CPython 3.12, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`9cff1aafe017cffdf3dc577f3b55de5fcf8a7527f9d1d3603cc19d6eb3db2e1c`
MD5	`e5e63813df5d654f5ed35591775e55eb`
BLAKE2b-256	`ba02c3f4f309919819df698b22f34a7006e9bcb22fd1581fe7ce1324c9833979`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl:

Publisher: python-wheel-publish.yml on wnsgus00114-droid/lightning-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lightning_core-0.1.4-cp312-cp312-macosx_14_0_arm64.whl
- Subject digest: 9cff1aafe017cffdf3dc577f3b55de5fcf8a7527f9d1d3603cc19d6eb3db2e1c
- Sigstore transparency entry: 1195243557
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: wnsgus00114-droid/lightning-core@1d16c7a660d08a539a775c9a5060cb0a65d3981a
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/wnsgus00114-droid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-wheel-publish.yml@1d16c7a660d08a539a775c9a5060cb0a65d3981a
- Trigger Event: push

lightning-core 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

1. Title

Lightning Core: Metal-First Runtime for Attention, MatMul, and Fused Inference Pipelines

2. Badges

3. One-line Summary

4. Abstract

5. Motivation / Problem Statement

6. Key Idea

7. Contributions

8. System Architecture

9. Execution Model

10. Fused Pipeline Design

11. Benchmark Setup

12. Benchmark Results

13. Key Findings / Insights

14. Limitations

15. Future Work

16. Installation

17. Quick Start

18. Core API Overview

19. Input Rules

20. MatMul Usage

21. Attention Usage

22. Convolution Usage

23. Resident Blocks

24. Pipeline Usage

25. Performance Tips

26. API Examples

27. Benchmark Overview

28. Benchmark Directory Structure

29. How to Run Benchmarks

30. Benchmark Output Files

31. How to Read the Results

32. Reproducing README Numbers

33. Benchmark Methodology Notes

34. Repository Structure

35. Roadmap

36. Citation

37. License

38. Contributing

39. Project Status

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance