Skip to main content

A pytest plugin that prevents crashes from killing your test suite

Project description

pytest-fkit

Fix Krashes In Tests - A pytest plugin that prevents crashes from killing your entire test suite.

When a test crashes Python (SIGABRT, SIGSEGV, etc.), it catches the crash and converts it to a normal pytest ERROR instead of killing your entire test run.

Features:

  • Parallel workers with GPU affinity
  • Sliced test distribution (default) - tests are pre-distributed across workers for deterministic, efficient execution
  • Crash isolation - each test runs in its own subprocess
  • Automatic GPU error detection and retry
  • Fault tolerance (workers can fail without stopping the test run)

The Problem

When running large test suites (like HuggingFace Transformers), sometimes a test causes Python to crash with a signal like SIGABRT:

Fatal Python error: Aborted
Thread 0x0000799e2ea00640 (most recent call first):
  File "/transformers/src/transformers/models/dots1/modeling_dots1.py", line 331 in forward
  ...

This kills pytest entirely, and all remaining tests in your suite never run.

The Solution

pytest-fkit runs each test in an isolated subprocess. If a test crashes:

  • ✅ The crash is caught and reported as a pytest ERROR
  • ✅ The remaining tests continue running
  • ✅ You get a full report with all test results, including which ones crashed

Installation

cd pytest-fkit
pip install -e .

Or install from your test requirements:

pip install pytest-fkit

Usage

Basic Usage

Just add the --fkit flag to your pytest command:

pytest --fkit

With Timeout

Set a timeout per test (default is 600 seconds / 10 minutes):

pytest --fkit --fkit-timeout=300  # 5 minute timeout per test

Parallel Workers with Sliced Distribution

Run tests in parallel with automatic slicing:

# Auto-detect workers based on GPU count
pytest --fkit --fkit-workers=auto

# Specific number of workers
pytest --fkit --fkit-workers=4

# Control GPUs per worker (for multi-GPU tests)
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2

Sliced Scheduling (default): Tests are pre-distributed across workers:

  1. Tests are sorted by nodeid for reproducibility
  2. Round-robin distribution: test[i] goes to worker[i % num_workers]
  3. Each worker runs its slice with crash isolation (subprocess per test)
  4. Workers run in parallel for maximum throughput

Example with 4 workers and 100 tests:

  • Worker 0: tests 0, 4, 8, 12, ... (25 tests)
  • Worker 1: tests 1, 5, 9, 13, ... (25 tests)
  • Worker 2: tests 2, 6, 10, 14, ... (25 tests)
  • Worker 3: tests 3, 7, 11, 15, ... (25 tests)

Execution Modes

# Batch mode (default) - pre-sliced, deterministic distribution
pytest --fkit --fkit-workers=4 --fkit-mode=batch

# Isolate mode - dynamic queue, on-demand assignment
pytest --fkit --fkit-workers=4 --fkit-mode=isolate
Mode Description Best For
batch Tests pre-sliced to workers Most use cases, reproducible
isolate Dynamic work queue Highly variable test durations

GPU Allocation Examples

8 GPUs with multi-GPU tests (need 2 GPUs each):

pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
# Worker 0: GPU 0,1
# Worker 1: GPU 2,3
# Worker 2: GPU 4,5
# Worker 3: GPU 6,7

8 GPUs with single-GPU tests:

pytest --fkit --fkit-workers=8 --fkit-gpus-per-worker=1
# Worker 0: GPU 0
# Worker 1: GPU 1
# ...
# Worker 7: GPU 7

Crash Isolation

Each test runs in its own subprocess, so crashes are contained:

  1. Crash Detection: SIGABRT, SIGSEGV, and other signals are caught
  2. Error Conversion: Crashes are converted to pytest ERROR results
  3. Suite Continuation: Remaining tests continue running on the worker
  4. Full Results: You get a complete report even if some tests crash

Example scenario:

Worker 0 (GPU 0,1): test_bert PASSED → test_llama PASSED → test_crash 💥 CRASH → test_gpt2 PASSED
Worker 1 (GPU 2,3): test_vit PASSED → test_whisper PASSED → test_t5 PASSED
Worker 2 (GPU 4,5): test_clip PASSED → test_blip PASSED → test_stable PASSED
Worker 3 (GPU 6,7): test_sam PASSED → test_dino PASSED → test_mae PASSED

# Crash on Worker 0 is isolated - other tests continue
# Final report shows 1 crash, 11 passed

Skip Crash Isolation for Specific Tests

If you have tests that don't play well with subprocess isolation, mark them:

import pytest

@pytest.mark.fkit_skip
def test_something_special():
    # This test will run normally without subprocess isolation
    pass

Mark GPU Requirements

Mark tests for documentation (future: optimal GPU scheduling):

import pytest

@pytest.mark.fkit_multi_gpu
def test_distributed_training():
    # This test needs multiple GPUs
    pass

@pytest.mark.fkit_single_gpu
def test_simple_forward():
    # This test needs only one GPU
    pass

How It Works

Architecture (Batch Mode - Default)

              ┌─────────────────────────────────────────────┐
              │           Test Collection (sorted)          │
              │  [test0, test1, test2, test3, test4, ...]   │
              └────────────────────┬────────────────────────┘
                                   │
                        Round-Robin Slicing
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         │                         │                         │
         ▼                         ▼                         ▼
 ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
 │   Worker 0    │         │   Worker 1    │         │   Worker 2    │
 │   GPU 0,1     │         │   GPU 2,3     │         │   GPU 4,5     │
 ├───────────────┤         ├───────────────┤         ├───────────────┤
 │ Slice:        │         │ Slice:        │         │ Slice:        │
 │  test0        │         │  test1        │         │  test2        │
 │  test3        │         │  test4        │         │  test5        │
 │  test6        │         │  test7        │         │  test8        │
 │  ...          │         │  ...          │         │  ...          │
 └───────┬───────┘         └───────┬───────┘         └───────┬───────┘
         │                         │                         │
         ▼                         ▼                         ▼
 ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
 │  Subprocess   │         │  Subprocess   │         │  Subprocess   │
 │  per test     │         │  per test     │         │  per test     │
 │  (isolated)   │         │  (isolated)   │         │  (isolated)   │
 └───────────────┘         └───────────────┘         └───────────────┘

Flow

  1. GPU Detection: Automatically detects AMD (ROCm) or NVIDIA GPUs
  2. Worker Creation: Creates N worker threads, each with dedicated GPUs
  3. Test Slicing: Tests sorted and distributed via round-robin
  4. Parallel Execution: Each worker runs its slice independently
  5. Subprocess Isolation: Each test runs in its own subprocess (crash protection)
  6. Result Reporting: Results stream back to pytest as tests complete

Example Output

🚀 pytest-fkit: 4 workers, 8 AMD GPUs, 2 GPU(s)/worker
   GPU allocations: ['0,1', '2,3', '4,5', '6,7']
   Mode: batch - sliced scheduling (tests pre-distributed to workers)

🔄 Running 1000 tests across 4 workers (sliced scheduling - each worker gets 1/4 of tests)...

📊 Test distribution across 4 workers:
   Worker 0: 250 tests
   Worker 1: 250 tests
   Worker 2: 250 tests
   Worker 3: 250 tests
   Worker 0 (GPUs: 0,1): 250 tests
   Worker 1 (GPUs: 2,3): 250 tests
   Worker 2 (GPUs: 4,5): 250 tests
   Worker 3 (GPUs: 6,7): 250 tests

tests/models/bert/test_modeling_bert.py::BertModelTest::test_forward PASSED
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_forward PASSED
tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_forward PASSED

======================================================================
✅ Completed 1000 tests
   Passed: 950, Failed: 45, Skipped: 5
   💥 Crashes: 2
======================================================================

=============== pytest-fkit summary ===============
💥 2 test(s) CRASHED (converted to ERROR by pytest-fkit):
  - tests/models/dots1/test_modeling_dots1.py::Dots1ModelTest::test_model_15b

✅ pytest-fkit prevented 2 crashes from killing your test suite!

Command Line Options

Option Default Description
--fkit False Enable crash isolation
--fkit-timeout 600 Timeout per test in seconds
--fkit-workers 1 Number of parallel workers (auto for GPU-based)
--fkit-gpus-per-worker 2 GPUs assigned to each worker
--fkit-mode batch batch (pre-sliced) or isolate (dynamic queue)
--fkit-threads-per-worker auto CPU threads per worker (auto = cores/workers)
--fkit-max-retries 3 Max retries for transient errors

Environment Variables Set Per Worker

Variable Description
CUDA_VISIBLE_DEVICES GPU IDs for NVIDIA / compatibility
HIP_VISIBLE_DEVICES GPU IDs for AMD ROCm (0-based within ROCR set)
ROCR_VISIBLE_DEVICES Physical GPU IDs for AMD ROCm runtime
FKIT_WORKER_ID Worker index (0, 1, 2, ...)
FKIT_GPU_IDS Assigned physical GPU IDs string
MASTER_PORT Per-worker NCCL port (29500 + worker_id)
MASTER_ADDR NCCL address (127.0.0.1)
NCCL_ASYNC_ERROR_HANDLING Enabled (prevents NCCL hangs)
NCCL_SOCKET_IFNAME Loopback interface (avoids NIC issues)
OMP_NUM_THREADS CPU threads per worker

Crash Recovery

After a test crash (SIGABRT, SIGSEGV, etc.):

  1. 5s cooldown for GPU driver to reclaim resources
  2. GPU health probe - spawns subprocess to allocate a tensor and sync
  3. If probe fails, 10s extended cooldown + second probe
  4. If still unhealthy, worker disabled and remaining tests redistributed to healthy workers
  5. If healthy, continue with next test

This prevents the cascade where one crash leaves the GPU unusable and all subsequent tests on that worker fail with "No HIP GPUs are available".

GPU Error Patterns Detected

The following error patterns trigger automatic retry:

  • No HIP GPUs are available / No CUDA GPUs are available
  • CUDA out of memory / hipErrorOutOfMemory / HIP out of memory
  • hipErrorNoDevice / cudaErrorNoDevice
  • NCCL Error 2: unhandled system error / NCCL error
  • Network/DNS errors (DNS resolution, connection refused, timeouts)
  • HuggingFace Hub HTTP errors (502, 503, 504)

Performance Considerations

  • Overhead: ~100-500ms per test for subprocess spawning
  • Parallelism: N workers = ~N× throughput (minus overhead)
  • GPU Memory: Each worker has dedicated GPUs - no memory contention
  • Deterministic: Same test distribution every run (batch mode)
  • Crash Isolation: One crash doesn't affect other tests

Recommended Configurations

Scenario Workers GPUs/Worker Mode Command
8 GPUs, multi-GPU tests 4 2 batch --fkit-workers=4 --fkit-gpus-per-worker=2
8 GPUs, single-GPU tests 8 1 batch --fkit-workers=8 --fkit-gpus-per-worker=1
4 GPUs, mixed tests 2 2 batch --fkit-workers=2 --fkit-gpus-per-worker=2
No GPUs (CPU tests) auto - batch --fkit-workers=auto
Highly variable durations 4 2 isolate --fkit-workers=4 --fkit-mode=isolate

Configuration File

Enable pytest-fkit in pytest.ini or pyproject.toml:

# pytest.ini
[pytest]
addopts = --fkit --fkit-timeout=600 --fkit-workers=auto
# pyproject.toml
[tool.pytest.ini_options]
addopts = ["--fkit", "--fkit-timeout=600", "--fkit-workers=auto"]

Comparison with pytest-xdist

Feature pytest-fkit pytest-xdist
Crash isolation ✅ Yes (per-test subprocess) ❌ No
GPU affinity ✅ Yes (automatic) ❌ Manual
Parallel execution ✅ Yes ✅ Yes
Sliced scheduling ✅ Yes (round-robin) ✅ Yes (load-based)
GPU error retry ✅ Yes (isolate mode) ❌ No
Worker fault tolerance ✅ Yes ⚠️ Limited
Memory isolation ✅ Per-test ⚠️ Per-worker
Reproducible distribution ✅ Yes (deterministic) ⚠️ Varies
Overhead Higher Lower

Use pytest-fkit when:

  • Tests can crash Python (GPU drivers, C extensions)
  • You need automatic GPU affinity
  • You need per-test isolation
  • GPU availability is unreliable
  • You want automatic retry on GPU errors

Use pytest-xdist when:

  • Tests are stable (no crashes)
  • You need minimal overhead
  • Tests don't use GPUs

Compatibility

  • Python 3.8+
  • pytest 6.0+
  • Linux, macOS (Windows support TBD)
  • AMD ROCm GPUs (detected via rocm-smi)
  • NVIDIA GPUs (detected via nvidia-smi)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_fkit-0.8.0.tar.gz (40.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_fkit-0.8.0-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file pytest_fkit-0.8.0.tar.gz.

File metadata

  • Download URL: pytest_fkit-0.8.0.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pytest_fkit-0.8.0.tar.gz
Algorithm Hash digest
SHA256 f27255bf590c317fd73bd0143ce5199dffc79f72cf04445988c80423bae0174c
MD5 cb2213296065f196b22accc6af908b9b
BLAKE2b-256 72b791f43e129c2433094af02a74f0bde4012d580ce2a8544b9e1d33ce094791

See more details on using hashes here.

File details

Details for the file pytest_fkit-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: pytest_fkit-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pytest_fkit-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51a897baf0c949b65916b215126ba34d45c4e066e4fd83476491d0630c35a94d
MD5 694d4c597492bd836506fbda779d3027
BLAKE2b-256 98269183937b1e8a211a5b96055880624aee3abe3b4a7c6c1d507dcc691c4d7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page