A pytest plugin that prevents crashes from killing your test suite, with execution tracing

These details have not been verified by PyPI

Project links

Project description

pytest-fkit

Fix Krashes In Tests - A pytest plugin that prevents crashes from killing your entire test suite.

When a test crashes Python (SIGABRT, SIGSEGV, etc.), it catches the crash and converts it to a normal pytest ERROR instead of killing your entire test run.

Features:

Parallel workers with GPU affinity
Dynamic work queue scheduling (tests go to first available worker)
Automatic GPU error detection and retry on different workers
Fault tolerance (workers can fail without stopping the test run)

The Problem

When running large test suites (like HuggingFace Transformers), sometimes a test causes Python to crash with a signal like SIGABRT:

Fatal Python error: Aborted
Thread 0x0000799e2ea00640 (most recent call first):
  File "/transformers/src/transformers/models/dots1/modeling_dots1.py", line 331 in forward
  ...

This kills pytest entirely, and all remaining tests in your suite never run.

The Solution

pytest-fkit runs each test in an isolated subprocess. If a test crashes:

✅ The crash is caught and reported as a pytest ERROR
✅ The remaining tests continue running
✅ You get a full report with all test results, including which ones crashed

Installation

cd pytest-fkit
pip install -e .

Or install from your test requirements:

pip install pytest-fkit

Usage

Basic Usage

Just add the --fkit flag to your pytest command:

pytest --fkit

With Timeout

Set a timeout per test (default is 600 seconds / 10 minutes):

pytest --fkit --fkit-timeout=300  # 5 minute timeout per test

Parallel Workers with Dynamic Scheduling

Run tests in parallel with automatic work distribution:

# Auto-detect workers based on GPU count
pytest --fkit --fkit-workers=auto

# Specific number of workers
pytest --fkit --fkit-workers=4

# Control GPUs per worker (for multi-GPU tests)
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2

Dynamic Scheduling: Tests are NOT pre-assigned to workers. Instead:

All tests go into a shared work queue
Workers pull tests as they become available
First available worker gets the next test
Automatic load balancing across workers

GPU Allocation Examples

8 GPUs with multi-GPU tests (need 2 GPUs each):

pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
# Worker 0: GPU 0,1
# Worker 1: GPU 2,3
# Worker 2: GPU 4,5
# Worker 3: GPU 6,7

8 GPUs with single-GPU tests:

pytest --fkit --fkit-workers=8 --fkit-gpus-per-worker=1
# Worker 0: GPU 0
# Worker 1: GPU 1
# ...
# Worker 7: GPU 7

Fault Tolerance

pytest-fkit handles GPU failures gracefully:

GPU Error Detection: Automatically detects GPU-related errors (CUDA OOM, HIP errors, etc.)
Automatic Retry: If a test fails due to GPU errors, it's automatically retried on a different worker
Worker Disabling: If a worker encounters 3+ consecutive GPU errors, it's disabled and remaining tests are scheduled to healthy workers
No Test Loss: Even if GPUs are missing or workers fail, all tests will eventually run on available workers

Example scenario:

Worker 0 (GPU 0,1): Running tests...
Worker 1 (GPU 2,3): Running tests...
Worker 2 (GPU 4,5): ⚠️ GPU 4 missing - CUDA error
                   → Test retried on Worker 0
Worker 3 (GPU 6,7): Running tests...

# Worker 2 disabled after 3 GPU errors
# Remaining tests automatically go to Workers 0, 1, 3

Skip Crash Isolation for Specific Tests

If you have tests that don't play well with subprocess isolation, mark them:

import pytest

@pytest.mark.fkit_skip
def test_something_special():
    # This test will run normally without subprocess isolation
    pass

Mark GPU Requirements

Mark tests for documentation (future: optimal GPU scheduling):

import pytest

@pytest.mark.fkit_multi_gpu
def test_distributed_training():
    # This test needs multiple GPUs
    pass

@pytest.mark.fkit_single_gpu
def test_simple_forward():
    # This test needs only one GPU
    pass

How It Works

Architecture

                    ┌─────────────────────────────────────┐
                    │          Shared Work Queue          │
                    │  [test1, test2, test3, test4, ...]  │
                    └──────────────┬──────────────────────┘
                                   │
            ┌──────────────────────┼──────────────────────┐
            │                      │                      │
            ▼                      ▼                      ▼
    ┌───────────────┐      ┌───────────────┐      ┌───────────────┐
    │   Worker 0    │      │   Worker 1    │      │   Worker 2    │
    │  GPU 0,1      │      │  GPU 2,3      │      │  GPU 4,5      │
    │               │      │               │      │               │
    │  Pull next    │      │  Pull next    │      │  Pull next    │
    │  available    │      │  available    │      │  available    │
    │  test         │      │  test         │      │  test         │
    └───────┬───────┘      └───────┬───────┘      └───────┬───────┘
            │                      │                      │
            ▼                      ▼                      ▼
    ┌───────────────┐      ┌───────────────┐      ┌───────────────┐
    │  Subprocess   │      │  Subprocess   │      │  Subprocess   │
    │  (isolated)   │      │  (isolated)   │      │  (isolated)   │
    └───────────────┘      └───────────────┘      └───────────────┘

Flow

GPU Detection: Automatically detects AMD (ROCm) or NVIDIA GPUs
Worker Creation: Creates N worker threads, each with dedicated GPUs
Queue Population: All tests go into a shared work queue
Dynamic Dispatch: Workers pull tests from the queue as they finish
Subprocess Isolation: Each test runs in its own subprocess
Error Handling: GPU errors trigger retry on different workers
Result Reporting: Results stream back to pytest as tests complete

Example Output

🚀 pytest-fkit: 4 workers, 8 AMD GPUs, 2 GPU(s)/worker
   GPU allocations: ['0,1', '2,3', '4,5', '6,7']
   Dynamic scheduling: tests assigned to first available worker

🔄 Running 1000 tests across 4 workers (dynamic scheduling)...

tests/models/bert/test_modeling_bert.py::BertModelTest::test_forward PASSED
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_forward PASSED
   🔄 Retrying test_model_15b on another worker (attempt 2)
tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_forward PASSED
⚠️  Worker 2 (GPUs: 4,5) disabled after 3 consecutive GPU errors

======================================================================
✅ Completed 1000 tests
   Passed: 950, Failed: 45, Skipped: 5
   💥 Crashes: 2
   🎮 GPU errors: 8 (retries: 5)
   ⚠️  Workers disabled: 1
======================================================================

=============== pytest-fkit summary ===============
💥 2 test(s) CRASHED (converted to ERROR by pytest-fkit):
  - tests/models/dots1/test_modeling_dots1.py::Dots1ModelTest::test_model_15b

✅ pytest-fkit prevented 2 crashes from killing your test suite!

Command Line Options

Option	Default	Description
`--fkit`	`False`	Enable crash isolation
`--fkit-timeout`	`600`	Timeout per test in seconds
`--fkit-workers`	`1`	Number of parallel workers (`auto` for GPU-based)
`--fkit-gpus-per-worker`	`2`	GPUs assigned to each worker

Environment Variables Set Per Worker

Variable	Description
`CUDA_VISIBLE_DEVICES`	GPU IDs for NVIDIA / compatibility
`HIP_VISIBLE_DEVICES`	GPU IDs for AMD ROCm
`ROCR_VISIBLE_DEVICES`	GPU IDs for AMD ROCm runtime
`FKIT_WORKER_ID`	Worker index (0, 1, 2, ...)
`FKIT_GPU_IDS`	Assigned GPU IDs string

GPU Error Patterns Detected

The following error patterns trigger automatic retry on a different worker:

CUDA out of memory
CUDA error / HIP error
hipErrorNoBinaryForGpu
hipErrorOutOfMemory
NCCL error
device-side assert
GPU not found / no GPU
cudaErrorNoDevice / hipErrorNoDevice

Performance Considerations

Overhead: ~100-500ms per test for subprocess spawning
Parallelism: N workers = ~N× throughput (minus overhead)
GPU Memory: Each worker has dedicated GPUs - no memory contention
Dynamic Balancing: Fast tests don't block slow tests
Fault Tolerance: Workers can fail without stopping the suite

Recommended Configurations

Scenario	Workers	GPUs/Worker	Command
8 GPUs, multi-GPU tests	4	2	`--fkit-workers=4 --fkit-gpus-per-worker=2`
8 GPUs, single-GPU tests	8	1	`--fkit-workers=8 --fkit-gpus-per-worker=1`
4 GPUs, mixed tests	2	2	`--fkit-workers=2 --fkit-gpus-per-worker=2`
No GPUs (CPU tests)	auto	-	`--fkit-workers=auto`
Unreliable GPUs	4+	2	Enable retry with more workers

Configuration File

Enable pytest-fkit in pytest.ini or pyproject.toml:

# pytest.ini
[pytest]
addopts = --fkit --fkit-timeout=600 --fkit-workers=auto

# pyproject.toml
[tool.pytest.ini_options]
addopts = ["--fkit", "--fkit-timeout=600", "--fkit-workers=auto"]

Comparison with pytest-xdist

Feature	pytest-fkit	pytest-xdist
Crash isolation	✅ Yes	❌ No
GPU affinity	✅ Yes	❌ Manual
Parallel execution	✅ Yes	✅ Yes
Dynamic scheduling	✅ Yes	✅ Yes
GPU error retry	✅ Yes	❌ No
Worker fault tolerance	✅ Yes	⚠️ Limited
Memory isolation	✅ Per-test	⚠️ Per-worker
Overhead	Higher	Lower

Use pytest-fkit when:

Tests can crash Python (GPU drivers, C extensions)
You need automatic GPU affinity
You need per-test isolation
GPU availability is unreliable
You want automatic retry on GPU errors

Use pytest-xdist when:

Tests are stable (no crashes)
You need minimal overhead
Tests don't use GPUs

Compatibility

Python 3.8+
pytest 6.0+
Linux, macOS (Windows support TBD)
AMD ROCm GPUs (detected via rocm-smi)
NVIDIA GPUs (detected via nvidia-smi)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.0

Apr 21, 2026

0.10.0

Apr 19, 2026

0.9.6

Apr 16, 2026

0.9.5

Feb 19, 2026

0.9.4

Feb 19, 2026

0.9.3

Feb 18, 2026

0.9.2

Feb 17, 2026

0.9.1

Feb 17, 2026

0.9.0

Feb 17, 2026

0.8.0

Feb 16, 2026

0.7.0

Feb 16, 2026

0.6.1

Feb 16, 2026

0.6.0

Feb 14, 2026

0.5.0

Feb 13, 2026

0.3.4

Feb 12, 2026

0.3.3

Feb 12, 2026

0.3.2

Feb 11, 2026

0.3.1

Feb 10, 2026

0.3.0

Feb 3, 2026

This version

0.2.0

Feb 3, 2026

0.1.0

Jan 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_fkit-0.2.0.tar.gz (80.4 kB view details)

Uploaded Feb 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_fkit-0.2.0-py3-none-any.whl (86.5 kB view details)

Uploaded Feb 3, 2026 Python 3

File details

Details for the file pytest_fkit-0.2.0.tar.gz.

File metadata

Download URL: pytest_fkit-0.2.0.tar.gz
Upload date: Feb 3, 2026
Size: 80.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pytest_fkit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cf1f5b7268881f5a473597fcfc979a37f987f42192933757f3f022f60c1fd0e9`
MD5	`2e8e8eb0ea533a3d1e0a1fbdca0ab7eb`
BLAKE2b-256	`ccdfa2d3aa99813d98346b20256c166b1641d46fc4039af29e13cadce74ab09b`

See more details on using hashes here.

File details

Details for the file pytest_fkit-0.2.0-py3-none-any.whl.

File metadata

Download URL: pytest_fkit-0.2.0-py3-none-any.whl
Upload date: Feb 3, 2026
Size: 86.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for pytest_fkit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62665efaa88347bc5dfbb0a05fb9da1f43e60999065189b813db001989bfd723`
MD5	`fc7503b493e539ae87904c7c3adedd40`
BLAKE2b-256	`0c1da3d788999243490850095eefc37834583c36de966fa49f6979fcfae995d7`

See more details on using hashes here.

pytest-fkit 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pytest-fkit

The Problem

The Solution

Installation

Usage

Basic Usage

With Timeout

Parallel Workers with Dynamic Scheduling

GPU Allocation Examples

Fault Tolerance

Skip Crash Isolation for Specific Tests

Mark GPU Requirements

How It Works

Architecture

Flow

Example Output

Command Line Options

Environment Variables Set Per Worker

GPU Error Patterns Detected

Performance Considerations

Recommended Configurations

Configuration File

Comparison with pytest-xdist

Compatibility

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes