A pytest plugin that prevents crashes from killing your test suite, with execution tracing
Project description
pytest-fkit
Fix Krashes In Tests - A pytest plugin that prevents crashes from killing your entire test suite.
When a test crashes Python (SIGABRT, SIGSEGV, etc.), it catches the crash and converts it to a normal pytest ERROR instead of killing your entire test run.
Features:
- Parallel workers with GPU affinity
- Dynamic work queue scheduling (tests go to first available worker)
- Automatic GPU error detection and retry on different workers
- Fault tolerance (workers can fail without stopping the test run)
The Problem
When running large test suites (like HuggingFace Transformers), sometimes a test causes Python to crash with a signal like SIGABRT:
Fatal Python error: Aborted
Thread 0x0000799e2ea00640 (most recent call first):
File "/transformers/src/transformers/models/dots1/modeling_dots1.py", line 331 in forward
...
This kills pytest entirely, and all remaining tests in your suite never run.
The Solution
pytest-fkit runs each test in an isolated subprocess. If a test crashes:
- ✅ The crash is caught and reported as a pytest ERROR
- ✅ The remaining tests continue running
- ✅ You get a full report with all test results, including which ones crashed
Installation
cd pytest-fkit
pip install -e .
Or install from your test requirements:
pip install pytest-fkit
Usage
Basic Usage
Just add the --fkit flag to your pytest command:
pytest --fkit
With Timeout
Set a timeout per test (default is 600 seconds / 10 minutes):
pytest --fkit --fkit-timeout=300 # 5 minute timeout per test
Parallel Workers with Dynamic Scheduling
Run tests in parallel with automatic work distribution:
# Auto-detect workers based on GPU count
pytest --fkit --fkit-workers=auto
# Specific number of workers
pytest --fkit --fkit-workers=4
# Control GPUs per worker (for multi-GPU tests)
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
Dynamic Scheduling: Tests are NOT pre-assigned to workers. Instead:
- All tests go into a shared work queue
- Workers pull tests as they become available
- First available worker gets the next test
- Automatic load balancing across workers
GPU Allocation Examples
8 GPUs with multi-GPU tests (need 2 GPUs each):
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
# Worker 0: GPU 0,1
# Worker 1: GPU 2,3
# Worker 2: GPU 4,5
# Worker 3: GPU 6,7
8 GPUs with single-GPU tests:
pytest --fkit --fkit-workers=8 --fkit-gpus-per-worker=1
# Worker 0: GPU 0
# Worker 1: GPU 1
# ...
# Worker 7: GPU 7
Fault Tolerance
pytest-fkit handles GPU failures gracefully:
-
GPU Error Detection: Automatically detects GPU-related errors (CUDA OOM, HIP errors, etc.)
-
Automatic Retry: If a test fails due to GPU errors, it's automatically retried on a different worker
-
Worker Disabling: If a worker encounters 3+ consecutive GPU errors, it's disabled and remaining tests are scheduled to healthy workers
-
No Test Loss: Even if GPUs are missing or workers fail, all tests will eventually run on available workers
Example scenario:
Worker 0 (GPU 0,1): Running tests...
Worker 1 (GPU 2,3): Running tests...
Worker 2 (GPU 4,5): ⚠️ GPU 4 missing - CUDA error
→ Test retried on Worker 0
Worker 3 (GPU 6,7): Running tests...
# Worker 2 disabled after 3 GPU errors
# Remaining tests automatically go to Workers 0, 1, 3
Skip Crash Isolation for Specific Tests
If you have tests that don't play well with subprocess isolation, mark them:
import pytest
@pytest.mark.fkit_skip
def test_something_special():
# This test will run normally without subprocess isolation
pass
Mark GPU Requirements
Mark tests for documentation (future: optimal GPU scheduling):
import pytest
@pytest.mark.fkit_multi_gpu
def test_distributed_training():
# This test needs multiple GPUs
pass
@pytest.mark.fkit_single_gpu
def test_simple_forward():
# This test needs only one GPU
pass
How It Works
Architecture
┌─────────────────────────────────────┐
│ Shared Work Queue │
│ [test1, test2, test3, test4, ...] │
└──────────────┬──────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Worker 0 │ │ Worker 1 │ │ Worker 2 │
│ GPU 0,1 │ │ GPU 2,3 │ │ GPU 4,5 │
│ │ │ │ │ │
│ Pull next │ │ Pull next │ │ Pull next │
│ available │ │ available │ │ available │
│ test │ │ test │ │ test │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Subprocess │ │ Subprocess │ │ Subprocess │
│ (isolated) │ │ (isolated) │ │ (isolated) │
└───────────────┘ └───────────────┘ └───────────────┘
Flow
- GPU Detection: Automatically detects AMD (ROCm) or NVIDIA GPUs
- Worker Creation: Creates N worker threads, each with dedicated GPUs
- Queue Population: All tests go into a shared work queue
- Dynamic Dispatch: Workers pull tests from the queue as they finish
- Subprocess Isolation: Each test runs in its own subprocess
- Error Handling: GPU errors trigger retry on different workers
- Result Reporting: Results stream back to pytest as tests complete
Example Output
🚀 pytest-fkit: 4 workers, 8 AMD GPUs, 2 GPU(s)/worker
GPU allocations: ['0,1', '2,3', '4,5', '6,7']
Dynamic scheduling: tests assigned to first available worker
🔄 Running 1000 tests across 4 workers (dynamic scheduling)...
tests/models/bert/test_modeling_bert.py::BertModelTest::test_forward PASSED
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_forward PASSED
🔄 Retrying test_model_15b on another worker (attempt 2)
tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_forward PASSED
⚠️ Worker 2 (GPUs: 4,5) disabled after 3 consecutive GPU errors
======================================================================
✅ Completed 1000 tests
Passed: 950, Failed: 45, Skipped: 5
💥 Crashes: 2
🎮 GPU errors: 8 (retries: 5)
⚠️ Workers disabled: 1
======================================================================
=============== pytest-fkit summary ===============
💥 2 test(s) CRASHED (converted to ERROR by pytest-fkit):
- tests/models/dots1/test_modeling_dots1.py::Dots1ModelTest::test_model_15b
✅ pytest-fkit prevented 2 crashes from killing your test suite!
Command Line Options
| Option | Default | Description |
|---|---|---|
--fkit |
False |
Enable crash isolation |
--fkit-timeout |
600 |
Timeout per test in seconds |
--fkit-workers |
1 |
Number of parallel workers (auto for GPU-based) |
--fkit-gpus-per-worker |
2 |
GPUs assigned to each worker |
Environment Variables Set Per Worker
| Variable | Description |
|---|---|
CUDA_VISIBLE_DEVICES |
GPU IDs for NVIDIA / compatibility |
HIP_VISIBLE_DEVICES |
GPU IDs for AMD ROCm |
ROCR_VISIBLE_DEVICES |
GPU IDs for AMD ROCm runtime |
FKIT_WORKER_ID |
Worker index (0, 1, 2, ...) |
FKIT_GPU_IDS |
Assigned GPU IDs string |
GPU Error Patterns Detected
The following error patterns trigger automatic retry on a different worker:
CUDA out of memoryCUDA error/HIP errorhipErrorNoBinaryForGpuhipErrorOutOfMemoryNCCL errordevice-side assertGPU not found/no GPUcudaErrorNoDevice/hipErrorNoDevice
Performance Considerations
- Overhead: ~100-500ms per test for subprocess spawning
- Parallelism: N workers = ~N× throughput (minus overhead)
- GPU Memory: Each worker has dedicated GPUs - no memory contention
- Dynamic Balancing: Fast tests don't block slow tests
- Fault Tolerance: Workers can fail without stopping the suite
Recommended Configurations
| Scenario | Workers | GPUs/Worker | Command |
|---|---|---|---|
| 8 GPUs, multi-GPU tests | 4 | 2 | --fkit-workers=4 --fkit-gpus-per-worker=2 |
| 8 GPUs, single-GPU tests | 8 | 1 | --fkit-workers=8 --fkit-gpus-per-worker=1 |
| 4 GPUs, mixed tests | 2 | 2 | --fkit-workers=2 --fkit-gpus-per-worker=2 |
| No GPUs (CPU tests) | auto | - | --fkit-workers=auto |
| Unreliable GPUs | 4+ | 2 | Enable retry with more workers |
Configuration File
Enable pytest-fkit in pytest.ini or pyproject.toml:
# pytest.ini
[pytest]
addopts = --fkit --fkit-timeout=600 --fkit-workers=auto
# pyproject.toml
[tool.pytest.ini_options]
addopts = ["--fkit", "--fkit-timeout=600", "--fkit-workers=auto"]
Comparison with pytest-xdist
| Feature | pytest-fkit | pytest-xdist |
|---|---|---|
| Crash isolation | ✅ Yes | ❌ No |
| GPU affinity | ✅ Yes | ❌ Manual |
| Parallel execution | ✅ Yes | ✅ Yes |
| Dynamic scheduling | ✅ Yes | ✅ Yes |
| GPU error retry | ✅ Yes | ❌ No |
| Worker fault tolerance | ✅ Yes | ⚠️ Limited |
| Memory isolation | ✅ Per-test | ⚠️ Per-worker |
| Overhead | Higher | Lower |
Use pytest-fkit when:
- Tests can crash Python (GPU drivers, C extensions)
- You need automatic GPU affinity
- You need per-test isolation
- GPU availability is unreliable
- You want automatic retry on GPU errors
Use pytest-xdist when:
- Tests are stable (no crashes)
- You need minimal overhead
- Tests don't use GPUs
Compatibility
- Python 3.8+
- pytest 6.0+
- Linux, macOS (Windows support TBD)
- AMD ROCm GPUs (detected via
rocm-smi) - NVIDIA GPUs (detected via
nvidia-smi)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_fkit-0.2.0.tar.gz.
File metadata
- Download URL: pytest_fkit-0.2.0.tar.gz
- Upload date:
- Size: 80.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cf1f5b7268881f5a473597fcfc979a37f987f42192933757f3f022f60c1fd0e9
|
|
| MD5 |
2e8e8eb0ea533a3d1e0a1fbdca0ab7eb
|
|
| BLAKE2b-256 |
ccdfa2d3aa99813d98346b20256c166b1641d46fc4039af29e13cadce74ab09b
|
File details
Details for the file pytest_fkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pytest_fkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 86.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62665efaa88347bc5dfbb0a05fb9da1f43e60999065189b813db001989bfd723
|
|
| MD5 |
fc7503b493e539ae87904c7c3adedd40
|
|
| BLAKE2b-256 |
0c1da3d788999243490850095eefc37834583c36de966fa49f6979fcfae995d7
|