Fast Walsh-Hadamard Transform with CPU, OpenMP, and CUDA backends

These details have not been verified by PyPI

Project links

Project description

pyfwht - Fast Walsh-Hadamard Transform for Python

Python bindings for the high-performance libfwht library, providing Fast Walsh-Hadamard Transform with NumPy integration and support for CPU (SIMD), OpenMP, and CUDA backends.

Features

Zero-copy NumPy integration: Direct operation on NumPy arrays without data copying
Multiple backends: Automatic selection or explicit choice of CPU (SIMD), OpenMP, or GPU (CUDA)
All data types: Support for int8, int32, and float64 with overflow protection
Boolean function analysis: Convenience functions for cryptographic applications
High performance:
- Recursive cache-efficient algorithm (512-element L1-optimized base case)
- Task-based OpenMP parallelism (2-3× speedup on 4-8 cores)
- Software prefetching and cache-aligned memory allocation
- SIMD optimization (AVX2/SSE2/NEON auto-detection)
- 19% faster CPU, 89% better OpenMP scaling vs v1.0.0
Easy to use: Pythonic API with comprehensive error handling and numerical documentation

Installation

Requirements

Python 3.8+
NumPy >= 1.20.0
C99 compiler (gcc, clang, msvc)
Optional: OpenMP-capable compiler for multi-threading
Optional: CUDA toolkit (nvcc) for GPU support

From PyPI

# Install (automatically enables CUDA if nvcc is found)
pip install pyfwht

# On Linux, you may need to build from source for CUDA support
pip install pyfwht --no-binary :all:

# Disable CUDA even if available
USE_CUDA=0 pip install pyfwht --no-binary :all:

From Source

git clone https://github.com/hadipourh/fwht
cd fwht/python
pip install -e .  # Auto-detects CUDA if nvcc is available

# Force CUDA on/off
USE_CUDA=1 pip install -e .  # Force enable (fails if nvcc not found)
USE_CUDA=0 pip install -e .  # Force disable

Quick Start

Basic Transform

import numpy as np
import pyfwht as fwht

# Create data (must be power of 2 length)
data = np.array([1, -1, -1, 1, -1, 1, 1, -1], dtype=np.int32)

# In-place transform
fwht.transform(data)
print(data)  # Transformed coefficients

Boolean Function Analysis

# XOR function: f(x,y) = x ⊕ y
truth_table = np.array([0, 1, 1, 0], dtype=np.uint8)

# Compute WHT coefficients (0→+1, 1→-1 convention)
wht_coeffs = fwht.from_bool(truth_table, signed=True)

# Compute correlations with all linear functions
correlations = fwht.correlations(truth_table)
max_correlation = np.max(np.abs(correlations))
print(f"Maximum absolute correlation: {max_correlation}")

Backend Selection

# Automatic backend selection (recommended)
fwht.transform(data)  # or backend=fwht.Backend.AUTO

# Check availability first
print("OpenMP available:", fwht.has_openmp())
print("GPU/CUDA available:", fwht.has_gpu())

# Explicit backend
fwht.transform(data, backend=fwht.Backend.CPU)     # Single-threaded SIMD
fwht.transform(data, backend=fwht.Backend.OPENMP)  # Multi-threaded

# Use GPU only if available
if fwht.has_gpu():
    fwht.transform(data, backend=fwht.Backend.GPU)

Efficient Repeated Transforms

For computing many WHTs, use Context to reuse resources:

with fwht.Context(backend=fwht.Backend.OPENMP) as ctx:
    for _ in range(1000):
        data = generate_data()
        ctx.transform(data)  # Faster than repeated fwht.transform()

API Reference

Main Functions

`transform(data, backend=None)`

In-place Walsh-Hadamard Transform with automatic dtype dispatch.

Parameters:
- data: 1-D NumPy array of int8, int32, or float64
- backend: Optional Backend enum (AUTO, CPU, OPENMP, GPU)
Modifies: data in-place
Complexity: O(n log n) where n = 2^k is the array length

data = np.array([1, -1, -1, 1], dtype=np.int32)
fwht.transform(data)  # data is modified

`compute(data, backend=None)`

Out-of-place transform (input unchanged, returns new array).

Parameters:
- data: 1-D NumPy array of int32 or float64
- backend: Optional backend selection
Returns: New NumPy array with transform result

original = np.array([1, -1, -1, 1], dtype=np.int32)
result = fwht.compute(original)
# original is unchanged, result contains WHT

`from_bool(truth_table, signed=True)`

Compute WHT coefficients from Boolean function truth table.

Parameters:
- truth_table: 1-D array of 0s and 1s (length = 2^k)
- signed: If True, uses 0→+1, 1→-1 conversion (cryptographic convention)
Returns: int32 array of WHT coefficients

# AND function
and_table = np.array([0, 0, 0, 1], dtype=np.uint8)
wht = fwht.from_bool(and_table, signed=True)

`correlations(truth_table)`

Compute correlations between Boolean function and all linear functions.

Parameters:
- truth_table: 1-D array of 0s and 1s
Returns: float64 array of correlation values in [-1, 1]

corr = fwht.correlations(truth_table)
# corr[u] = correlation with linear function ℓ_u(x) = popcount(u & x) mod 2

Context API (Advanced)

For applications computing many WHTs, use Context to amortize setup costs:

Context Parameters:

backend: Backend selection (Backend enum)
num_threads: Number of OpenMP threads (0 = auto)
gpu_device: GPU device ID for CUDA
normalize: If True, divide by sqrt(n) after transform

Methods:

ctx.transform(data): In-place transform (same as module-level function)
ctx.close(): Explicitly release resources (or use with statement)

Backend Enum

class Backend(enum.Enum):
    AUTO = 0    # Automatic selection (recommended)
    CPU = 1     # Single-threaded SIMD (AVX2/SSE2/NEON)
    OPENMP = 2  # Multi-threaded CPU
    GPU = 3     # CUDA-accelerated

Utility Functions

fwht.is_power_of_2(n)        # Check if n is power of 2
fwht.log2(n)                 # Compute log₂(n) for power of 2
fwht.recommend_backend(n)    # Get recommended backend for size n
fwht.has_openmp()            # Check OpenMP availability
fwht.has_gpu()               # Check GPU/CUDA availability
fwht.version()               # Get library version

Data Types

NumPy dtype	C type	Notes
`np.int32`	`int32_t`	Recommended for Boolean functions
`np.float64`	`double`	For numerical applications
`np.int8`	`int8_t`	Memory-efficient; may overflow for large n

Performance Tips

Choose the right backend for your data size:
- Small arrays (< 2^16): CPU backend (SIMD-optimized)
- Medium arrays (2^16 - 2^22): OPENMP (multi-threaded)
- Large arrays (> 2^22): GPU if available
- Unsure? Use Backend.AUTO or recommend_backend(n)

Reuse contexts for batch processing:

ctx = fwht.Context(backend=fwht.Backend.OPENMP)
for data in dataset:
    ctx.transform(data)
ctx.close()

Use int32 for Boolean functions (exact arithmetic, no overflow for n ≤ 2^30)
Use int8 for memory-constrained applications (but beware overflow for n > 2^7)

Advanced Examples

Cryptographic Linear Cryptanalysis

Find the best linear approximation of a Boolean function:

import numpy as np
import pyfwht as fwht

# S-box or Boolean function (e.g., 4-bit input, 1-bit output)
# Example: f(x) for x in {0,1}^4
truth_table = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0], dtype=np.uint8)

# Compute Walsh-Hadamard coefficients
# For signed convention: 0 → +1, 1 → -1
wht = fwht.from_bool(truth_table, signed=True)

# Find best linear approximation
n = int(np.log2(len(truth_table)))  # Number of input bits
best_idx = np.argmax(np.abs(wht))
best_wht = wht[best_idx]

# Compute correlation: cor(f, ℓ_u) = W_f(u) / 2^n
correlation = best_wht / (2**n)

# Compute bias: ε = W_f(u) / 2^(n+1)
bias = best_wht / (2**(n+1))

print(f"Best linear mask u: {best_idx:0{n}b}")
print(f"WHT coefficient: {best_wht}")
print(f"Correlation: {correlation:.4f}")
print(f"Bias: {bias:.4f}")
print(f"Linear probability: {0.5 + bias:.4f}")

Batch Processing: Computing Nonlinearity

Analyze cryptographic properties of Boolean functions:

import numpy as np
import pyfwht as fwht

# Generate 1000 random Boolean functions
num_vars = 8  # Number of input variables
num_functions = 1000
functions = np.random.randint(0, 2, size=(num_functions, 2**num_vars), dtype=np.uint8)

# Compute nonlinearity for all functions
nonlinearities = []
for func in functions:
    # from_bool computes WHT coefficients with signed convention
    wht = fwht.from_bool(func, signed=True)
    
    # Nonlinearity: NL(f) = 2^(n-1) - (1/2)·max|W_f(u)|
    # For n-variable function, length of truth table is 2^n
    max_abs_wht = np.max(np.abs(wht))
    nl = 2**(num_vars - 1) - max_abs_wht // 2
    nonlinearities.append(nl)

print(f"Average nonlinearity: {np.mean(nonlinearities):.2f}")
print(f"Max nonlinearity: {max(nonlinearities)}")
print(f"Min nonlinearity: {min(nonlinearities)}")
print(f"Theoretical max for {num_vars}-bit functions: {2**(num_vars-1) - 2**(num_vars//2 - 1)}")

Performance Comparison: Backend Selection

import numpy as np
import pyfwht as fwht
import time

def benchmark_backends(size, num_repeats=10, num_warmup=2):
    """Compare performance across different backends with statistical rigor."""
    data = np.random.randn(size).astype(np.float64)
    results = {}
    
    backends = [
        (fwht.Backend.CPU, "CPU (SIMD)"),
        (fwht.Backend.OPENMP, "OpenMP"),
    ]
    
    if fwht.has_gpu():
        backends.append((fwht.Backend.GPU, "GPU (CUDA)"))
    
    for backend, name in backends:
        timings = []
        
        # Warmup runs to stabilize cache and CPU frequency
        for _ in range(num_warmup):
            test_data = data.copy()
            fwht.transform(test_data, backend=backend)
        
        # Benchmark runs
        for _ in range(num_repeats):
            test_data = data.copy()
            start = time.perf_counter()
            fwht.transform(test_data, backend=backend)
            elapsed = time.perf_counter() - start
            timings.append(elapsed * 1000)  # Convert to ms
        
        # Statistical analysis
        timings_array = np.array(timings)
        mean_time = np.mean(timings_array)
        std_time = np.std(timings_array)
        median_time = np.median(timings_array)
        min_time = np.min(timings_array)
        
        # Use minimum time for throughput (best-case scenario, least noise)
        throughput = (size * fwht.log2(size)) / (min_time / 1000) / 1e9
        
        results[name] = {
            'mean': mean_time,
            'std': std_time,
            'median': median_time,
            'min': min_time,
            'throughput': throughput  # GOps/s based on minimum time
        }
    
    return results

# Test different sizes with sufficient repetitions
print("FWHT Performance Benchmark (Python)")
print("=" * 80)
print(f"Warmup runs: 2, Benchmark runs: 10 per configuration")
print(f"GPU available: {fwht.has_gpu()}")
print(f"OpenMP available: {fwht.has_openmp()}")
print(f"Version: {fwht.version()}")
print()

for k in range(20, 30, 2):
    size = 2**k
    print(f"\nSize: {size:,} (2^{k})")
    print("-" * 80)
    results = benchmark_backends(size, num_repeats=10, num_warmup=2)
    
    for name, metrics in results.items():
        print(f"  {name:15s}: {metrics['min']:7.2f} ms (min)  "
              f"{metrics['mean']:7.2f} ± {metrics['std']:5.2f} ms (mean±std)  "
              f"[{metrics['throughput']:5.2f} GOps/s]")

Numerical Accuracy Validation

import numpy as np
import pyfwht as fwht

def test_orthogonality(n):
    """
    Verify WHT orthogonality: WHT(WHT(x)) = n * x
    """
    x = np.random.randn(n)
    
    # Forward transform
    y = fwht.compute(x)
    
    # Inverse transform (forward again, then divide by n)
    x_reconstructed = fwht.compute(y) / n
    
    # Check reconstruction error
    error = np.linalg.norm(x - x_reconstructed)
    rel_error = error / np.linalg.norm(x)
    
    print(f"Size {n}: Relative error = {rel_error:.2e}")
    return rel_error < 1e-10

# Test for various sizes
for k in range(4, 16):
    assert test_orthogonality(2**k), f"Failed for size 2^{k}"

print("All orthogonality tests passed!")

Benchmark Results

Performance Comparison: CPU vs OpenMP vs GPU

Benchmark performed on GPU server with statistical rigor (10 runs per configuration, 2 warmup runs).

System Configuration:

GPU: NVIDIA GeForce RTX 5090 (32 GB GDDR7)
CPU: AMD EPYC 9334 32-Core Processor (64 threads with SMT)
System RAM: 377 GB
CUDA: Version 13.0 (driver 580.95.05, nvcc V13.0.88)
Library Version: pyfwht 1.1.4

FWHT Performance Benchmark (Python)
================================================================================
Warmup runs: 2, Benchmark runs: 10 per configuration
GPU available: True
OpenMP available: True
Version: 1.1.4


Size: 1,048,576 (2^20)
--------------------------------------------------------------------------------
  CPU (SIMD)     :    4.11 ms (min)     4.15 ±  0.03 ms (mean±std)  [ 5.11 GOps/s]
  OpenMP         :    1.60 ms (min)    21.53 ± 31.49 ms (mean±std)  [13.09 GOps/s]
  GPU (CUDA)     :    0.86 ms (min)     0.87 ±  0.01 ms (mean±std)  [24.45 GOps/s]

Size: 4,194,304 (2^22)
--------------------------------------------------------------------------------
  CPU (SIMD)     :   20.78 ms (min)    21.50 ±  0.24 ms (mean±std)  [ 4.44 GOps/s]
  OpenMP         :    6.02 ms (min)    45.35 ± 38.26 ms (mean±std)  [15.32 GOps/s]
  GPU (CUDA)     :    3.55 ms (min)     3.58 ±  0.02 ms (mean±std)  [26.00 GOps/s]

Size: 16,777,216 (2^24)
--------------------------------------------------------------------------------
  CPU (SIMD)     :   90.52 ms (min)    90.68 ±  0.16 ms (mean±std)  [ 4.45 GOps/s]
  OpenMP         :   23.61 ms (min)    26.43 ±  2.96 ms (mean±std)  [17.06 GOps/s]
  GPU (CUDA)     :   16.32 ms (min)    16.35 ±  0.02 ms (mean±std)  [24.67 GOps/s]

Size: 67,108,864 (2^26)
--------------------------------------------------------------------------------
  CPU (SIMD)     :  447.80 ms (min)   448.15 ±  0.17 ms (mean±std)  [ 3.90 GOps/s]
  OpenMP         :  178.32 ms (min)   219.37 ± 19.93 ms (mean±std)  [ 9.78 GOps/s]
  GPU (CUDA)     :   66.09 ms (min)    66.15 ±  0.04 ms (mean±std)  [26.40 GOps/s]

Size: 268,435,456 (2^28)
--------------------------------------------------------------------------------
  CPU (SIMD)     : 2348.43 ms (min)  2350.81 ±  1.95 ms (mean±std)  [ 3.20 GOps/s]
  OpenMP         : 1178.15 ms (min)  1220.09 ± 26.66 ms (mean±std)  [ 6.38 GOps/s]
  GPU (CUDA)     :  268.10 ms (min)   268.44 ±  0.19 ms (mean±std)  [28.03 GOps/s]

Key Observations:

GPU Performance: Achieves 24-28 GOps/s consistently, with extremely low variance (std < 0.2 ms even for large transforms)
RTX 5090 Advantage: Latest generation GPU with GDDR7 memory provides excellent bandwidth for this memory-bound algorithm
OpenMP Scaling: 2-4x speedup over single-threaded CPU on 32-core system
CPU SIMD: Consistent ~4-5 GOps/s throughput with NEON/AVX2 optimizations
Speedup Summary:
- GPU vs CPU: 5.9x for small sizes (2^20), up to 8.8x for large sizes (2^28)
- GPU vs OpenMP: 2.8x for small sizes, up to 4.4x for large sizes
Python Overhead: Negligible - performance matches C library within measurement variance

Run your own benchmark:

cd python
# Use the improved benchmark from README examples section
python3 -c "$(sed -n '/def benchmark_backends/,/GOps\/s\]\")$/p' README.md)"

Examples

See examples/basic_usage.py for comprehensive usage demonstrations.

Development

Running Tests

First, install the package in development mode:

pip install -e .  # Install pyfwht in editable mode
pip install pytest
pytest tests/ -v

# With coverage
pip install pytest-cov
pytest tests/ --cov=pyfwht

Building Distribution Packages

pip install build
python -m build  # Creates both sdist and wheel in dist/

Relation to C Library

This package wraps the libfwht C library. All computation happens in highly-optimized C/CUDA code; Python provides only a thin interface layer.

For C/C++ projects, use the C library directly. For Python workflows, this package provides seamless NumPy integration.

License

GNU General Public License v3.0 or later (GPL-3.0-or-later)

See LICENSE file for full text.

Citation

If you use this library in your research, please cite:

@software{libfwht,
  author = {Hadipour, Hosein},
  title = {libfwht: Fast Walsh-Hadamard Transform Library},
  year = {2025},
  url = {https://github.com/hadipourh/fwht}
}

Support

Issues: https://github.com/hadipourh/fwht/issues
Email: hsn.hadipour@gmail.com
Documentation: https://github.com/hadipourh/fwht

Contributing

Contributions welcome! Please open an issue or pull request on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.1

Nov 27, 2025

2.0.0

Nov 27, 2025

This version

1.1.4

Nov 14, 2025

1.1.3

Nov 13, 2025

1.1.2

Nov 13, 2025

1.1.1

Nov 13, 2025

1.1.0

Nov 13, 2025

1.0.0

Nov 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfwht-1.1.4.tar.gz (54.3 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyfwht-1.1.4-cp313-cp313-macosx_15_0_arm64.whl (161.6 kB view details)

Uploaded Nov 14, 2025 CPython 3.13macOS 15.0+ ARM64

File details

Details for the file pyfwht-1.1.4.tar.gz.

File metadata

Download URL: pyfwht-1.1.4.tar.gz
Upload date: Nov 14, 2025
Size: 54.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pyfwht-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`b48db1a6dc42ed123ca82d8a5c2bd97725cc5650530588a17ba773e6200bec78`
MD5	`b5bf07172f2b36492117303048b787d5`
BLAKE2b-256	`27a6213487a3441f50ecfdabf9dc9863e0fd686d43ee5297aacc9d6b16903f2f`

See more details on using hashes here.

File details

Details for the file pyfwht-1.1.4-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

Download URL: pyfwht-1.1.4-cp313-cp313-macosx_15_0_arm64.whl
Upload date: Nov 14, 2025
Size: 161.6 kB
Tags: CPython 3.13, macOS 15.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pyfwht-1.1.4-cp313-cp313-macosx_15_0_arm64.whl
Algorithm	Hash digest
SHA256	`3d603fae6675bb5b61c0492bb4fae23090f1740e80ba7015231454ab1693bc12`
MD5	`e61a151924c2bccc4a0f4652452d86f3`
BLAKE2b-256	`0be7dc86fa8fefe8caa2443f7404582cb15421e206b6698f3dc6c7779c45403d`

See more details on using hashes here.

pyfwht 1.1.4

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

pyfwht - Fast Walsh-Hadamard Transform for Python

Features

Installation

Requirements

From PyPI

From Source

Quick Start

Basic Transform

Boolean Function Analysis

Backend Selection

Efficient Repeated Transforms

API Reference

Main Functions

transform(data, backend=None)

compute(data, backend=None)

from_bool(truth_table, signed=True)

correlations(truth_table)

Context API (Advanced)

Backend Enum

Utility Functions

Data Types

Performance Tips

Advanced Examples

Cryptographic Linear Cryptanalysis

Batch Processing: Computing Nonlinearity

Performance Comparison: Backend Selection

Numerical Accuracy Validation

Benchmark Results

Performance Comparison: CPU vs OpenMP vs GPU

Examples

Development

Running Tests

Building Distribution Packages

Relation to C Library

License

Citation

Support

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`transform(data, backend=None)`

`compute(data, backend=None)`

`from_bool(truth_table, signed=True)`

`correlations(truth_table)`