Fast Walsh-Hadamard Transform with CPU, OpenMP, and CUDA backends

These details have not been verified by PyPI

Project links

Project description

pyfwht - Fast Walsh-Hadamard Transform for Python

Python bindings for the high-performance libfwht library, providing Fast Walsh-Hadamard Transform with NumPy integration and support for CPU (SIMD), OpenMP, and CUDA backends.

Features

Zero-copy NumPy integration: Direct operation on NumPy arrays without data copying
Multiple backends: Automatic selection or explicit choice of CPU (SIMD), OpenMP, or GPU (CUDA)
Multi-precision GPU: fp64 (cryptographic), fp32 (balanced), fp16 (maximum speed, up to 54× faster with PyTorch DLPack)
All data types: Support for int8, int32, and float64 with overflow protection
Boolean function analysis: Convenience functions for cryptographic applications
Bit-packed Boolean WHT: Compute WHT directly from uint64-packed truth tables via fwht.boolean_packed() (set backend=fwht.Backend.GPU to expand on the device for n ≤ 65536)
GPU-resident Boolean contexts: repeated fwht.boolean_packed(..., backend=fwht.Backend.GPU) calls automatically reuse the CUDA buffers via fwht_gpu_boolean_context_*, eliminating per-call allocations on S-box workloads
High performance:
- GPU fp16: Up to 1115 GOps/s on RTX 4090 with PyTorch DLPack (zero-copy, exceeds Meta by 38%)
- GPU fp32: Up to 625 GOps/s with perfect accuracy
- Tensor Core kernels: n=256, 512, 1024, 2048, 4096, 8192, 16384, 32768 (Meta-inspired implementation)
- DLPack support: 81× faster than NumPy for fp16 batch operations
- Recursive cache-efficient algorithm (512-element L1-optimized base case)
- Task-based OpenMP parallelism (2-3× speedup on 4-8 cores)
- Software prefetching and cache-aligned memory allocation
- SIMD optimization (AVX2/SSE2/NEON auto-detection)
Easy to use: Pythonic API with comprehensive error handling and numerical documentation

Installation

Requirements

Python 3.8+
NumPy >= 1.20.0
C99 compiler (gcc, clang, msvc)
Optional: OpenMP-capable compiler for multi-threading
Optional: CUDA toolkit (nvcc) for GPU support

From PyPI

# Install (automatically enables CUDA if nvcc is found)
pip install pyfwht

# On Linux, you may need to build from source for CUDA support
pip install pyfwht --no-binary :all:

# Disable CUDA even if available
USE_CUDA=0 pip install pyfwht --no-binary :all:

From Source

git clone https://github.com/hadipourh/fwht
cd fwht/python
pip install -e .  # Auto-detects CUDA if nvcc is available

# To enable GPU features with PyTorch (required for GPU DLPack API):
pip install -e ".[gpu]"  # Installs torch for DLPack GPU support

# Force CUDA on/off
USE_CUDA=1 pip install -e .  # Force enable (fails if nvcc not found)
USE_CUDA=0 pip install -e .  # Force disable

Quick Start

Basic Transform

import numpy as np
import pyfwht as fwht

# Create data (must be power of 2 length)
data = np.array([1, -1, -1, 1, -1, 1, 1, -1], dtype=np.int32)

# In-place transform
fwht.transform(data)
print(data)  # Transformed coefficients

Boolean Function Analysis

## Performance & Precision Reference

Detailed benchmark tables, GPU multi-precision comparisons, and FP16 precision guidelines live in the top-level `README.md`. The Python README focuses on installation and API usage—consult the root documentation whenever you need:

- GOps/s numbers for each backend or hardware platform
- Tensor Core coverage, runtime warnings, and accuracy graphs for fp16/fp32/fp64
- End-to-end benchmarking instructions (CLI, C harnesses, and Python runners)

All measurements and architectural notes are kept in a single place to avoid divergence between the C and Python docs.

# Example: f(x) for x in {0,1}^4
truth_table = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0], dtype=np.uint8)

# Compute Walsh-Hadamard coefficients
# For signed convention: 0 → +1, 1 → -1
wht = fwht.from_bool(truth_table, signed=True)

# Find best linear approximation
n = int(np.log2(len(truth_table)))  # Number of input bits
best_idx = np.argmax(np.abs(wht))
best_wht = wht[best_idx]

# Compute correlation: cor(f, ℓ_u) = W_f(u) / 2^n
correlation = best_wht / (2**n)

# Compute bias: ε = W_f(u) / 2^(n+1)
bias = best_wht / (2**(n+1))

print(f"Best linear mask u: {best_idx:0{n}b}")
print(f"WHT coefficient: {best_wht}")
print(f"Correlation: {correlation:.4f}")
print(f"Bias: {bias:.4f}")
print(f"Linear probability: {0.5 + bias:.4f}")

Batch Processing: Computing Nonlinearity

Analyze cryptographic properties of Boolean functions:

import numpy as np
import pyfwht as fwht

# Generate 1000 random Boolean functions
num_vars = 8  # Number of input variables
num_functions = 1000
functions = np.random.randint(0, 2, size=(num_functions, 2**num_vars), dtype=np.uint8)

# Compute nonlinearity for all functions
nonlinearities = []
for func in functions:
    # from_bool computes WHT coefficients with signed convention
    wht = fwht.from_bool(func, signed=True)
    
    # Nonlinearity: NL(f) = 2^(n-1) - (1/2)·max|W_f(u)|
    # For n-variable function, length of truth table is 2^n
    max_abs_wht = np.max(np.abs(wht))
    nl = 2**(num_vars - 1) - max_abs_wht // 2
    nonlinearities.append(nl)

print(f"Average nonlinearity: {np.mean(nonlinearities):.2f}")
print(f"Max nonlinearity: {max(nonlinearities)}")
print(f"Min nonlinearity: {min(nonlinearities)}")
print(f"Theoretical max for {num_vars}-bit functions: {2**(num_vars-1) - 2**(num_vars//2 - 1)}")

Vectorial S-box Analysis

Use the high-level helper to compute Boolean component spectra and full LAT metrics:

import numpy as np
import pyfwht as fwht

# 3-bit identity S-box
table = np.arange(8, dtype=np.uint32)

analysis = fwht.analyze_sbox(
  table,
  backend=fwht.Backend.CPU,
  profile_timings=True,
  return_spectra=True,
  return_lat=True,
)

print("m_bits", analysis.components.m_bits)
print("n_bits", analysis.components.n_bits)
print("max_walsh", analysis.components.max_walsh)
print("lat_max", analysis.lat.lat_max if analysis.lat else None)

# Spectra shape: (n_bits, 2^m)
spectra = analysis.components.spectra
# LAT matrix shape: (2^m, 2^n)
lat_matrix = analysis.lat.lat if analysis.lat else None

Performance Comparison: Backend Selection

import numpy as np
import pyfwht as fwht
import time

def benchmark_backends(size, num_repeats=10, num_warmup=2):
    """Compare performance across different backends with statistical rigor."""
    data = np.random.randn(size).astype(np.float64)
    results = {}
    
    backends = [
        (fwht.Backend.CPU, "CPU (SIMD)"),
        (fwht.Backend.OPENMP, "OpenMP"),
    ]
    
    if fwht.has_gpu():
        backends.append((fwht.Backend.GPU, "GPU (CUDA)"))
    
    for backend, name in backends:
        timings = []
        
        # Warmup runs to stabilize cache and CPU frequency
        for _ in range(num_warmup):
            test_data = data.copy()
            fwht.transform(test_data, backend=backend)
        
        # Benchmark runs
        for _ in range(num_repeats):
            test_data = data.copy()
            start = time.perf_counter()
            fwht.transform(test_data, backend=backend)
            elapsed = time.perf_counter() - start
            timings.append(elapsed * 1000)  # Convert to ms
        
        # Statistical analysis
        timings_array = np.array(timings)
        mean_time = np.mean(timings_array)
        std_time = np.std(timings_array)
        median_time = np.median(timings_array)
        min_time = np.min(timings_array)
        
        # Use minimum time for throughput (best-case scenario, least noise)
        throughput = (size * fwht.log2(size)) / (min_time / 1000) / 1e9
        
        results[name] = {
            'mean': mean_time,
            'std': std_time,
            'median': median_time,
            'min': min_time,
            'throughput': throughput  # GOps/s based on minimum time
        }
    
    return results

# Test different sizes with sufficient repetitions
print("FWHT Performance Benchmark (Python)")
print("=" * 80)
print(f"Warmup runs: 2, Benchmark runs: 10 per configuration")
print(f"GPU available: {fwht.has_gpu()}")
print(f"OpenMP available: {fwht.has_openmp()}")
print(f"Version: {fwht.version()}")
print()

for k in range(20, 30, 2):
    size = 2**k
    print(f"\nSize: {size:,} (2^{k})")
    print("-" * 80)
    results = benchmark_backends(size, num_repeats=10, num_warmup=2)
    
    for name, metrics in results.items():
        print(f"  {name:15s}: {metrics['min']:7.2f} ms (min)  "
              f"{metrics['mean']:7.2f} ± {metrics['std']:5.2f} ms (mean±std)  "
              f"[{metrics['throughput']:5.2f} GOps/s]")

Numerical Accuracy Validation

import numpy as np
import pyfwht as fwht

def test_orthogonality(n):
    """
    Verify WHT orthogonality: WHT(WHT(x)) = n * x
    """
    x = np.random.randn(n)
    
    # Forward transform
    y = fwht.compute(x)
    
    # Inverse transform (forward again, then divide by n)
    x_reconstructed = fwht.compute(y) / n
    
    # Check reconstruction error
    error = np.linalg.norm(x - x_reconstructed)
    rel_error = error / np.linalg.norm(x)
    
    print(f"Size {n}: Relative error = {rel_error:.2e}")
    return rel_error < 1e-10

# Test for various sizes
for k in range(4, 16):
    assert test_orthogonality(2**k), f"Failed for size 2^{k}"

print("All orthogonality tests passed!")

Benchmark Results

Performance Comparison: CPU vs OpenMP vs GPU

Benchmark performed on GPU server with statistical rigor (10 runs per configuration, 2 warmup runs).

System Configuration:

GPU: NVIDIA GeForce RTX 5090 (32 GB GDDR7)
CPU: AMD EPYC 9334 32-Core Processor (64 threads with SMT)
System RAM: 377 GB
CUDA: Version 13.0 (driver 580.95.05, nvcc V13.0.88)
Library Version: pyfwht 1.1.4

FWHT Performance Benchmark (Python)
================================================================================
Warmup runs: 2, Benchmark runs: 10 per configuration
GPU available: True
OpenMP available: True
Version: 1.1.4


Size: 1,048,576 (2^20)
--------------------------------------------------------------------------------
  CPU (SIMD)     :    4.11 ms (min)     4.15 ±  0.03 ms (mean±std)  [ 5.11 GOps/s]
  OpenMP         :    1.60 ms (min)    21.53 ± 31.49 ms (mean±std)  [13.09 GOps/s]
  GPU (CUDA)     :    0.86 ms (min)     0.87 ±  0.01 ms (mean±std)  [24.45 GOps/s]

Size: 4,194,304 (2^22)
--------------------------------------------------------------------------------
  CPU (SIMD)     :   20.78 ms (min)    21.50 ±  0.24 ms (mean±std)  [ 4.44 GOps/s]
  OpenMP         :    6.02 ms (min)    45.35 ± 38.26 ms (mean±std)  [15.32 GOps/s]
  GPU (CUDA)     :    3.55 ms (min)     3.58 ±  0.02 ms (mean±std)  [26.00 GOps/s]

Size: 16,777,216 (2^24)
--------------------------------------------------------------------------------
  CPU (SIMD)     :   90.52 ms (min)    90.68 ±  0.16 ms (mean±std)  [ 4.45 GOps/s]
  OpenMP         :   23.61 ms (min)    26.43 ±  2.96 ms (mean±std)  [17.06 GOps/s]
  GPU (CUDA)     :   16.32 ms (min)    16.35 ±  0.02 ms (mean±std)  [24.67 GOps/s]

Size: 67,108,864 (2^26)
--------------------------------------------------------------------------------
  CPU (SIMD)     :  447.80 ms (min)   448.15 ±  0.17 ms (mean±std)  [ 3.90 GOps/s]
  OpenMP         :  178.32 ms (min)   219.37 ± 19.93 ms (mean±std)  [ 9.78 GOps/s]
  GPU (CUDA)     :   66.09 ms (min)    66.15 ±  0.04 ms (mean±std)  [26.40 GOps/s]

Size: 268,435,456 (2^28)
--------------------------------------------------------------------------------
  CPU (SIMD)     : 2348.43 ms (min)  2350.81 ±  1.95 ms (mean±std)  [ 3.20 GOps/s]
  OpenMP         : 1178.15 ms (min)  1220.09 ± 26.66 ms (mean±std)  [ 6.38 GOps/s]
  GPU (CUDA)     :  268.10 ms (min)   268.44 ±  0.19 ms (mean±std)  [28.03 GOps/s]

Key Observations:

GPU Performance: Achieves 24-28 GOps/s consistently, with extremely low variance (std < 0.2 ms even for large transforms)
RTX 5090 Advantage: Latest generation GPU with GDDR7 memory provides excellent bandwidth for this memory-bound algorithm
OpenMP Scaling: 2-4x speedup over single-threaded CPU on 32-core system
CPU SIMD: Consistent ~4-5 GOps/s throughput with NEON/AVX2 optimizations
Speedup Summary:
- GPU vs CPU: 5.9x for small sizes (2^20), up to 8.8x for large sizes (2^28)
- GPU vs OpenMP: 2.8x for small sizes, up to 4.4x for large sizes
Python Overhead: Negligible - performance matches C library within measurement variance

Run your own benchmark:

cd python
# Use the improved benchmark from README examples section
python3 -c "$(sed -n '/def benchmark_backends/,/GOps\/s\]\")$/p' README.md)"

Multi-Precision GPU Performance (fp64/fp32/fp16)

The pyfwht GPU backend now supports multiple precision modes for different speed/accuracy trade-offs. Benchmark on NVIDIA RTX 4090:

System Configuration:

GPU: NVIDIA GeForce RTX 4090 (SM 8.9, 128 SMs, 24 GB GDDR6X)
CUDA: Version 12.x
Library Version: pyfwht 1.2.0+

Correctness Results (with precision-matched references)

Kernel               Max Error       Status              
-------------------------------------------------------
pyfwht fp64          0.00e+00        PASS               
pyfwht fp32          0.00e+00        PASS               
pyfwht fp16          1.25e-01        PASS

Boolean ±1 test vectors now match CPU exactly for fp16, fp32, and fp64. The fp16 number above comes from a Gaussian floating-point workload (benchmark_all_precisions_fixed.py) and reflects intrinsic fp16 rounding.

Performance Results

Single Transform (batch=1, dtype=float16):

Size	pyfwht GPU fp16 (ms / GOps/s)	Meta GPU fp16 (ms / GOps/s)	pyfwht Speedup
1024	0.032 ms / 0.64 GOps/s	0.051 ms / 0.40 GOps/s	1.6×
2048	0.034 ms / 1.31 GOps/s	0.054 ms / 0.84 GOps/s	1.6×
4096	0.036 ms / 2.76 GOps/s	0.049 ms / 2.02 GOps/s	1.4×

Batched Transforms (batch=100, dtype=float16):

Size	pyfwht GPU fp16 (ms / GOps/s)	Meta GPU fp16 (ms / GOps/s)	pyfwht Speedup
1024	0.030 ms / 68.01 GOps/s	0.049 ms / 41.65 GOps/s	1.6×
2048	0.030 ms / 148.59 GOps/s	0.049 ms / 91.59 GOps/s	1.6×
4096	0.031 ms / 314.83 GOps/s	0.049 ms / 199.43 GOps/s	1.6×

Key Highlights:

Tests run on NVIDIA RTX 4090 (driver 560.35.03, CUDA 12.6.85) using PyTorch tensors (zero-copy DLPack path).
pyfwht beats Meta's CUDA kernel by 1.4–1.6× at every measured size and batch depth.
Boolean workloads are now bit-exact on fp16; Gaussian floats exhibit max error ≈1.3e-1 (mean ≈2.5e-2, relative <6e-4).
fp32 stays the balanced option (~1e-6 error) if you do not need fp16's throughput.
Tensor Core kernels cover power-of-two sizes 256–32768; larger sizes fall back to the general CUDA backend.

Usage Example:

import torch
import pyfwht

# For maximum performance: use PyTorch tensors with DLPack (zero-copy)
data_fp64 = torch.randn(100, 4096, dtype=torch.float64, device='cuda')  # Max precision
data_fp32 = torch.randn(100, 4096, dtype=torch.float32, device='cuda')  # Balanced
data_fp16 = torch.randn(100, 4096, dtype=torch.float16, device='cuda')  # Max speed

# DLPack API (recommended): zero-copy, eliminates H2D/D2H overhead
pyfwht.gpu.batch_transform_dlpack(data_fp64)  # Uses fp64 kernel
pyfwht.gpu.batch_transform_dlpack(data_fp32)  # Uses fp32 kernel (balanced speed/accuracy)
pyfwht.gpu.batch_transform_dlpack(data_fp16)  # Uses fp16 kernel (maximum Tensor Core throughput)

# NumPy API (convenience): includes H2D/D2H transfers
import numpy as np
data_np = np.random.randn(100, 4096).astype(np.float16)
result = pyfwht.fwht(data_np, backend='cuda')  # Transfers to GPU, computes, transfers back

Performance Comparison:

DLPack (PyTorch): Keeps tensors on the GPU (see tables above for up to 315 GOps/s at n=4096, batch=100)
NumPy: Includes H2D/D2H transfers and is typically 10–100× slower for fp16 workloads
Recommendation: Use DLPack for any high-throughput batch job; reserve NumPy for convenience scripts

Run the multi-precision benchmark:

cd python/tests
python benchmark_all_precisions_fixed.py

FP16 Precision Characteristics (Important!)

FP16 provides up to ~35× speedup (with PyTorch DLPack) but uses limited precision (11-bit mantissa). This is the expected tradeoff for Tensor Core acceleration.

Measured Accuracy (RTX 4090, CUDA 12.6):

Boolean truth tables (±1 inputs): max|error| = 0 (fp16 == CPU)
Gaussian fp32 inputs (n=1024, batch=10): max|error| = 1.25e-01, mean = 2.47e-02, relative < 6e-4

FP16 roundoff is therefore inconsequential for Boolean cryptanalysis yet still bounded and predictable for ML workloads.

When to Use Each Precision:

Precision	Speed (relative)	Accuracy snapshot	Best For
fp64	1× (baseline)	Bit-exact	Cryptanalysis, validation
fp32	~30× faster	~1e-6 error	Balanced workloads
fp16	~35× faster	Boolean exact, ≤1.3e-1 abs. err on floats	ML/AI with PyTorch DLPack

Important: For fp16 throughput, keep data on the GPU (PyTorch DLPack). The NumPy path includes H2D/D2H copies and is orders of magnitude slower.

Supported Tensor Core Sizes:

All power-of-2 sizes from 256 to 32768 use optimized Tensor Core kernels:

n=256: Meta-inspired kernel (chunks_per_warp=1, 8 warps/block)
n=512: Meta-inspired kernel (chunks_per_warp=1, 8 warps/block)
n=1024: Meta-inspired kernel (chunks_per_warp=1, 8 warps/block)
n=2048: Meta-inspired kernel (chunks_per_warp=1, 8 warps/block)
n=4096: Meta-inspired kernel (chunks_per_warp=2, 8 warps/block)
n=8192: Meta-inspired kernel (chunks_per_warp=2, 8 warps/block)
n=16384: Meta-inspired kernel (chunks_per_warp=4, 8 warps/block)
n=32768: Meta-inspired kernel (chunks_per_warp=8, 8 warps/block)

Sizes outside this range fall back to standard GPU kernels (slower but still faster than CPU).

Runtime Warning:

When you first use fp16, you'll see a one-time warning:

╔═══════════════════════════════════════════════════════════════════════════╗
║ FP16 Tensor Core Precision Notice                                         ║
╠═══════════════════════════════════════════════════════════════════════════╣
║ Using float16 Tensor Cores provides 25-36× speedup                        ║
║                                                                           ║
║ Observed behavior (RTX 4090, CUDA 12.6):                                  ║
║   • Boolean {-1,+1} inputs remain bit-exact                               ║
║   • Random fp32/fp64 data: max|error| ≈ 1.3e-1, mean ≈ 2.5e-2             ║
║   • Relative error stays < 6e-4 for values in the ±4k range               ║
║                                                                           ║
║ Recommended use cases:                                                    ║
║   • Machine learning / signal processing (PyTorch DLPack)                 ║
║   • Boolean cryptanalysis                                                 ║
║   • High-precision floating workloads (use fp32/fp64 instead)             ║
║                                                                           ║
║ To suppress this warning: set FWHT_SILENCE_FP16_WARNING=1                 ║
╚═══════════════════════════════════════════════════════════════════════════╝

To suppress:

import os
os.environ['FWHT_SILENCE_FP16_WARNING'] = '1'
import pyfwht

Or in bash:

export FWHT_SILENCE_FP16_WARNING=1
python your_script.py

Verification:

GPU fp32 results are always bit-exact with CPU int32, proving the differences are purely from FP16 quantization, not algorithmic bugs.

Examples

The examples/ directory contains comprehensive demonstrations:

basic_usage.py - Core API: transforms, backends, Boolean functions, utilities
boolean_packed.py - Bit-packed Boolean WHT for memory-efficient cryptanalysis
gpu_batch.py - GPU batch transforms, profiling, and persistent contexts
gpu_multi_precision.py - fp16/fp32/fp64 with PyTorch DLPack integration
compare_fast_hadamard.py - Benchmarks vs. Dao-AILab's fast-hadamard-transform
sbox_analysis.py - S-box cryptanalysis with LAT computation (includes 16-bit GPU example)

Run any example:

cd examples
python basic_usage.py
python sbox_analysis.py  # Shows AES S-box analysis, LAT computation, and 16-bit GPU demo

Development

Running Tests

The tests/ directory contains comprehensive Python test suites and benchmarks:

# Install package in development mode
pip install -e .

# Install testing dependencies
pip install pytest pytest-cov

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=pyfwht --cov-report=html

# Run specific test files
pytest tests/test_correctness.py -v          # Core correctness tests
pytest tests/test_gpu.py -v                  # GPU tests (requires CUDA)

# Run benchmarks (not part of pytest suite)
python tests/benchmark_all_precisions_fixed.py     # Multi-precision GPU benchmarks
python tests/benchmark_compare_meta.py             # Compare vs Dao-AILab implementation

Available test files:

test_correctness.py - Core API validation (transforms, Boolean functions, backends, batch processing)
test_gpu.py - GPU-specific tests (Tensor Cores, fp16/fp32/fp64, batch operations, DLPack)
benchmark_all_precisions_fixed.py - Multi-precision performance benchmarks
benchmark_compare_meta.py - Comparative benchmarks against fast-hadamard-transform

GPU Tests: Tests in test_gpu.py automatically skip if CUDA is not available. Build with USE_CUDA=1 pip install -e . to enable GPU support.

Building Distribution Packages

pip install build
python -m build  # Creates both sdist and wheel in dist/

Relation to C Library

This package wraps the libfwht C library. All computation happens in highly-optimized C/CUDA code; Python provides only a thin interface layer.

For C/C++ projects, use the C library directly. For Python workflows, this package provides seamless NumPy integration.

License

GNU General Public License v3.0 or later (GPL-3.0-or-later)

See LICENSE file for full text.

Citation

If you use this library in your research, please cite:

@software{libfwht,
  author = {Hadipour, Hosein},
  title = {libfwht: Fast Walsh-Hadamard Transform Library},
  year = {2025},
  url = {https://github.com/hadipourh/fwht}
}

Support

Issues: https://github.com/hadipourh/fwht/issues
Email: hsn.hadipour@gmail.com
Documentation: https://github.com/hadipourh/fwht

Contributing

Contributions welcome! Please open an issue or pull request on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.1

Nov 27, 2025

2.0.0

Nov 27, 2025

1.1.4

Nov 14, 2025

1.1.3

Nov 13, 2025

1.1.2

Nov 13, 2025

1.1.1

Nov 13, 2025

1.1.0

Nov 13, 2025

1.0.0

Nov 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfwht-2.0.1.tar.gz (114.5 kB view details)

Uploaded Nov 27, 2025 Source

File details

Details for the file pyfwht-2.0.1.tar.gz.

File metadata

Download URL: pyfwht-2.0.1.tar.gz
Upload date: Nov 27, 2025
Size: 114.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pyfwht-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ab22ffa9b7cd3d1371257f35342d3071d1703e4be88d45b9475ea3c465c5475d`
MD5	`853daaea3dfd63e836f8985661b966b1`
BLAKE2b-256	`c6606b6f3131609bff01718378537ad1ee0d5b32803b2515df88f4a0728b04dd`

See more details on using hashes here.

pyfwht 2.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

pyfwht - Fast Walsh-Hadamard Transform for Python

Features

Installation

Requirements

From PyPI

From Source

Quick Start

Basic Transform

Boolean Function Analysis

Batch Processing: Computing Nonlinearity

Vectorial S-box Analysis

Performance Comparison: Backend Selection

Numerical Accuracy Validation

Benchmark Results

Performance Comparison: CPU vs OpenMP vs GPU

Multi-Precision GPU Performance (fp64/fp32/fp16)

Correctness Results (with precision-matched references)

Performance Results

FP16 Precision Characteristics (Important!)

Measured Accuracy (RTX 4090, CUDA 12.6):

When to Use Each Precision:

Supported Tensor Core Sizes:

Runtime Warning:

Verification:

Examples

Development

Running Tests

Building Distribution Packages

Relation to C Library

License

Citation

Support

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes