PROcess SImulation Tool — rule-aware business process simulation

These details have not been verified by PyPI

Project links

Project description

Prosit — PROcess SImulation Tool

Prosit is a Python library for rule-aware business process simulation. Given an event log in XES format and a Petri net process model, it automatically discovers simulation parameters (arrival rates, execution times, waiting times, resource assignments, routing probabilities) and runs discrete-event simulations that reproduce the statistical behaviour of the original process.

Unlike basic simulation tools, Prosit builds conditional models — decision trees that learn when each resource is preferred, how long an activity takes depending on the case context, and which path is taken at decision points.

Installation
Quick Start
Core Concepts
API Reference
- SimulatorParameters
- SimulatorEngine
Discovery Options
Simulation Options
Save and Load Parameters
Advanced Usage
Citation

Installation

Requirements: Python 3.10

Option 1 — Conda (recommended)

git clone https://github.com/franvinci/prosit
cd prosit
conda env create -f environment.yml
conda activate prosit

Option 2 — pip

git clone https://github.com/franvinci/prosit
cd prosit
pip install -r requirements.txt

Dependencies

Package	Version	Purpose
`pm4py`	2.4.1	Event log parsing, Petri net discovery and conformance
`scikit-learn`	1.1.3	Decision tree models (batch discovery)
`river`	0.22.0	Hoeffding Adaptive Tree (incremental discovery)
`scipy`	1.14.1	Distribution fitting and sampling
`numpy`	1.26.4	Numerical operations
`pandas`	2.2.3	Feature DataFrame construction
`tqdm`	4.64.1	Progress bars
`graphviz`	0.20.3	Decision tree visualisation

Quick Start

import sys
sys.path.append("src/")

import warnings
warnings.filterwarnings("ignore")

import pm4py
import pm4py.objects.log.importer.xes.importer as xes_importer
from prosit.simulator import SimulatorParameters, SimulatorEngine

# 1. Load event log
log = xes_importer.apply("data/logs/purchasing.xes")

# 2. Discover Petri net
net, im, fm = pm4py.discover_petri_net_inductive(log)

# 3. Create and populate simulation parameters
params = SimulatorParameters(net, im, fm)
params.discover_from_eventlog(log, max_depth_tree=3)

# 4. Save for later reuse
params.to_json("params_purchasing.json")

# 5. Simulate 500 cases
sim_engine = SimulatorEngine(params)
sim_log = sim_engine.apply(n_traces=500)

print(sim_log.head())

The output is a pandas.DataFrame with columns: case:concept:name, concept:name, org:resource, enabled:timestamp, start:timestamp, time:timestamp, plus any case-level data attributes found in the original log.

Core Concepts

What is discovered

Prosit extracts the following simulation parameters from an event log:

Parameter	What it models
Arrival time	Inter-arrival time between consecutive cases, conditional on hour and weekday
Execution time	Working-hours duration of each activity, conditional on resource and case history
Waiting time	Queue delay after a resource becomes free, conditional on workload and case context
Control flow	Routing probability at each decision point, conditional on case history and attributes
Resource selection	Which eligible resource executes each activity — one dedicated classifier per (activity, candidate resource) conditional on resource-usage history and case attributes
Calendars	Working hours per resource and for case arrivals
Multitasking	Maximum concurrent tasks per resource, derived from observed concurrent workload
Data attributes	Joint or per-attribute distribution of case-level data attributes (e.g. case type, priority)

Rules mode vs. no-rules mode

Rules mode (max_depth_tree >= 1): Parameters are learned as Decision Trees. Time models (arrival, execution, waiting) use regression trees — each leaf has its own fitted distribution. Routing and resource selection use classification trees that score each candidate transition or resource and sample proportionally.
No-rules mode (max_depth_tree=0): Each parameter collapses to a single fitted distribution or frequency weight — simpler, faster, less expressive.

Batch vs. incremental discovery

Batch discovery (default): Trains Decision Trees using scikit-learn with cross-validated hyperparameters (max_depth, min_samples_leaf, max_features). Classification models (control flow, resource selection) also compare against a prior-only DummyClassifier inside the same CV grid — when no tree beats the prior, the model collapses to a single marginal probability.
Incremental discovery (incremental_discovery=True): Uses Hoeffding Adaptive Trees from the river library. Gives more weight to recent traces, suitable for concept drift or evolving processes.

API Reference

SimulatorParameters

SimulatorParameters(net: PetriNet, initial_marking: Marking, final_marking: Marking)

Holds all simulation parameters. Initialise with the Petri net (typically discovered via pm4py.discover_petri_net_inductive).

`discover_from_eventlog`

params.discover_from_eventlog(
    log,
    max_depth_tree: int = 5,
    min_samples_leaf_cv: list = [50, 100, 200],
    multitasking_thr: float = 0.05,
    enable_multitasking: bool = False,
    arrival_calendar_min_confidence: float = 0.1,
    arrival_calendar_min_support: float = 0.7,
    res_calendar_min_confidence: float = 0.1,
    res_calendar_min_support: float = 0.1,
    res_calendar_min_participation: float = 0.4,
    attribute_mode: str = 'distribution',
    incremental_discovery: bool = False,
    grace_period: int = 1000,
    random_state: int = 72,
    verbose: bool = True,
    use_workload_features: bool = False,
)

Extracts all simulation parameters from the event log.

Parameter	Type	Default	Description
`log`	`EventLog`	—	pm4py event log in XES format
`max_depth_tree`	`int`	`5`	Maximum depth of decision trees. Higher = more expressive rules. Set to `0` to disable rules (pure distributions)
`min_samples_leaf_cv`	`list`	`[50,100,200]`	Candidate values for `min_samples_leaf` in cross-validation. Controls the minimum number of samples per leaf — critical for reliable per-leaf distribution fitting
`multitasking_thr`	`float`	`0.05`	Minimum fraction of events with concurrent workload > 0 for a resource to be considered multitasking. Below this, capacity is set to 1
`enable_multitasking`	`bool`	`False`	If `True`, resources whose log exhibits concurrent workload above `multitasking_thr` get a capacity > 1 (parallel task execution). Default `False`: all resources capacity 1
`arrival_calendar_min_confidence`	`float`	`0.1`	Minimum per-slot confidence required to keep an (weekday, hour) slot in the arrival calendar
`arrival_calendar_min_support`	`float`	`0.7`	Minimum fraction of arrivals that the accepted slots must cover; slots are greedily added until this is met
`res_calendar_min_confidence`	`float`	`0.1`	Per-slot confidence threshold for each resource's calendar
`res_calendar_min_support`	`float`	`0.1`	Minimum fraction of the resource's events that the accepted slots must cover
`res_calendar_min_participation`	`float`	`0.4`	Minimum per-resource participation share; below it the resource falls back to a 24/7 calendar
`attribute_mode`	`str`	`'distribution'`	How to model case-level data attributes. `'distribution'`: fits each attribute independently (categorical → frequency table, continuous → best-fitting scipy distribution). `'empirical'`: samples from the joint observed distribution (preserves correlations)
`incremental_discovery`	`bool`	`False`	Use Hoeffding Adaptive Trees instead of scikit-learn. Gives more weight to recent traces
`grace_period`	`int`	`1000`	(Incremental only) Number of observations before the tree considers splitting a node
`random_state`	`int`	`72`	Seed for all random operations (reproducibility)
`verbose`	`bool`	`True`	Print discovery progress
`use_workload_features`	`bool`	`False`	If `True`, resource-selection and waiting-time models receive two extra features per candidate resource: current `workload` (concurrent tasks) and `queue_length` (tasks scheduled but not yet started) at the enabling time

`to_json` / `from_json`

params.to_json("params.json")   # save to file

params2 = SimulatorParameters(net, im, fm)
params2.from_json("params.json")  # restore from file

Serialises and deserialises all discovered parameters. Allows discovering once and simulating many times without re-running discovery.

SimulatorEngine

SimulatorEngine(simulation_parameters: SimulatorParameters)

Discrete-event simulation engine. Takes a SimulatorParameters object and runs the simulation.

`apply`

sim_log = sim_engine.apply(
    n_traces: int = 1,
    t_start: datetime = None,
    deterministic_time: bool = False
) -> pd.DataFrame

Parameter	Type	Default	Description
`n_traces`	`int`	`1`	Number of process instances (cases) to simulate
`t_start`	`datetime`	`datetime.now()`	Start timestamp of the simulation. Cases arrive from this point onward
`deterministic_time`	`bool`	`False`	If `True`, uses the mean value of each distribution instead of sampling. Useful for deterministic analysis or debugging

Returns a pandas.DataFrame sorted by start time, with one row per simulated event.

Discovery Options

Choosing `max_depth_tree`

The max_depth_tree parameter controls the complexity of conditional models:

# No rules — single distribution per activity/resource
params.discover_from_eventlog(log, max_depth_tree=0)

# Shallow rules — fast, interpretable, good for small logs
params.discover_from_eventlog(log, max_depth_tree=2)

# Standard — balances expressiveness and overfitting
params.discover_from_eventlog(log, max_depth_tree=3)

# Deep rules — for complex processes with large logs
params.discover_from_eventlog(log, max_depth_tree=5)

Cross-validation automatically selects the best depth up to max_depth_tree for each individual model.

Controlling leaf size (`min_samples_leaf_cv`)

This parameter is especially important for time models (execution, waiting, arrival). Each leaf of the regression tree has its own fitted distribution — if a leaf contains too few samples, the distribution fit is unreliable.

# More conservative — larger leaves, smoother distributions
params.discover_from_eventlog(log, min_samples_leaf_cv=[10, 20, 30, 50])

# Less conservative — allows finer segmentation with small logs
params.discover_from_eventlog(log, min_samples_leaf_cv=[1, 5, 10])

Incremental discovery

Use incremental discovery when the process has evolved over time and recent behaviour should dominate the model:

params.discover_from_eventlog(
    log,
    incremental_discovery=True,
    grace_period=500,   # fewer events needed before first split
    max_depth_tree=3
)

The grace_period controls how quickly the tree adapts: lower values make the model react faster to changes, higher values produce more stable trees.

Simulation Options

Basic simulation

from datetime import datetime

sim_engine = SimulatorEngine(params)

# Simulate 200 cases starting from a specific date
sim_log = sim_engine.apply(
    n_traces=200,
    t_start=datetime(2024, 1, 1, 8, 0, 0)
)

Deterministic simulation

Uses the mean of each distribution instead of sampling — useful for benchmarking or debugging:

sim_log = sim_engine.apply(n_traces=100, deterministic_time=True)

Multitasking

Resources that handled concurrent work in the log are automatically assigned a maximum concurrency capacity. To disable multitasking entirely (all resources serialised):

params.discover_from_eventlog(log, enable_multitasking=False)

To inspect the discovered capacities:

# {resource_name: max_concurrent_tasks}
print(params.max_concurrency)

Resources with fewer than multitasking_thr fraction of events under concurrent load are treated as non-multitasking (capacity 1) even if enable_multitasking=True.

Output format

print(sim_log.columns.tolist())
# ['case:concept:name', 'concept:name', 'org:resource',
#  'enabled:timestamp', 'start:timestamp', 'time:timestamp',
#  ... (any case-level data attributes from the original log)]

print(sim_log.dtypes)
# All timestamps are datetime objects
# case:concept:name is a string like "case_1", "case_2", ...

Save and Load Parameters

Discovered parameters can be saved and reused without re-running the (potentially slow) discovery phase:

# --- Discover once ---
params = SimulatorParameters(net, im, fm)
params.discover_from_eventlog(log, max_depth_tree=3)
params.to_json("my_params.json")

# --- Load and simulate later ---
params2 = SimulatorParameters(net, im, fm)
params2.from_json("my_params.json")

engine = SimulatorEngine(params2)
sim_log = engine.apply(n_traces=1000)

Advanced Usage

Full workflow with evaluation

import sys
sys.path.append("src/")

import warnings
warnings.filterwarnings("ignore")

import pm4py
import pm4py.objects.log.importer.xes.importer as xes_importer
from prosit.simulator import SimulatorParameters, SimulatorEngine
from datetime import datetime

# Load log
log = xes_importer.apply("data/logs/purchasing.xes")

# Split: use 80% for discovery, compare simulation against remaining 20%
n_cases = len(log)
train_log = log[:int(n_cases * 0.8)]
test_log  = log[int(n_cases * 0.8):]

# Discover from training set
net, im, fm = pm4py.discover_petri_net_inductive(train_log)
params = SimulatorParameters(net, im, fm)
params.discover_from_eventlog(
    train_log,
    max_depth_tree=3,
    min_samples_leaf_cv=[50, 100, 200],
    random_state=42,
    verbose=True
)

# Simulate the same number of cases as the test set
engine = SimulatorEngine(params)
sim_log = engine.apply(
    n_traces=len(test_log),
    t_start=datetime(2024, 1, 1, 8, 0, 0)
)

print(f"Simulated {len(sim_log)} events across {sim_log['case:concept:name'].nunique()} cases")

No-rules mode (pure distributions)

For simple processes or small logs where decision trees might overfit:

params.discover_from_eventlog(log, max_depth_tree=0)
engine = SimulatorEngine(params)
sim_log = engine.apply(n_traces=200)

Reproducible simulation

import random
random.seed(42)

params.discover_from_eventlog(log, random_state=42)
sim_log = engine.apply(n_traces=100)

Inspecting discovered parameters

# Resources discovered from the log
print(params.resources)

# Working calendar per resource (weekday -> hour -> bool)
print(params.calendars["Resource A"])

# Per-resource maximum concurrency (1 = no multitasking, >1 = multitasking)
print(params.max_concurrency)

# Which resources can perform each activity
print(params.act_to_resources)

# Whether rules mode (Decision Trees) is active
print(params.rules_mode)

# Arrival time model (DecisionRules in rules mode, distribution tuple in no-rules mode)
print(params.arrival_time_distribution)

# Execution time model per activity (DecisionRules or distribution tuple)
print(params.execution_time_distributions)

# Waiting time model per resource (DecisionRules or distribution tuple)
print(params.waiting_time_distributions)

# Resource selection: flat dict {resource: DecisionRules|float}. One binary
# classifier per resource, trained on the events where the resource was
# eligible (activity's candidate pool). At simulation time, the engine first
# filters resources via `act_to_resources[activity]`, scores each enabled
# resource with its own tree, and samples proportionally.
print(params.resource_weights)

# Control flow model per transition (DecisionRules in rules mode, float frequency in no-rules)
print(params.transition_weights)

# Case-level data attribute distribution (None if no attributes in log)
print(params.distribution_data_attributes)
# {'mode': 'empirical', 'data': {(val1, val2): frequency, ...}}
# or {'mode': 'distribution', 'data': {attr: {'type': 'categorical'|'continuous', ...}}}

What the Models Learn

Features used per model

Model	Tree type	Conditional on
Arrival time	Regressor	Hour of day, weekday
Execution time	Regressor (per activity)	Resource identity (one-hot), hour, weekday, case attributes, activity history counts
Waiting time	Regressor (per resource)	Activity being waited for (one-hot), hour, weekday, case attributes, activity history counts; optionally `workload` and `queue_length` when `use_workload_features=True`
Control flow	Classifier (per transition)	Activity execution history (counts), case attributes
Resource selection	Classifier (per resource)	Per-resource history counts, activity being executed (one-hot), case attributes; optionally `workload` and `queue_length` when `use_workload_features=True`

History features are expressed as raw counts (number of times each activity has been executed in the case so far), so that decision tree rules are directly interpretable (e.g. "Approve" <= 2 means "Approve has been executed at most 2 times").

Before each classifier or regressor is fit, low-signal columns are pruned automatically: constant columns are dropped, and one-hot columns (resources, activities, categorical attribute values) with fewer than 20 positive observations in the current training slice are removed. This reduces noise from rare categories and keeps the CV grid compact.

For the time-regression models, cross-validation selects between every (max_depth, min_samples_leaf) combination and a no-tree baseline (global empirical distribution). If no candidate tree beats the baseline on per-leaf Wasserstein distance, the model collapses to a single unconditional distribution.

Distribution fitting

For each leaf node of a regression tree, Prosit fits the best distribution among: fixed, normal, exponential, lognormal, gamma, uniform. The best fit is selected by minimising the deterministic Wasserstein distance between the empirical and theoretical quantiles. Outliers are removed using the Median Absolute Deviation method (threshold: 20 MAD) before fitting arrival and execution times. Waiting times are fitted on the raw leaf values (no outlier removal), because they are typically zero-inflated and heavy-tailed — filtering would distort both the zero mass and the long tail needed to reproduce real cycle times.

Data attribute modeling

Case-level data attributes (e.g. case:type, case:priority) are discovered automatically and sampled at case arrival time. Two modes are available via attribute_mode:

'distribution' (default): fits each attribute independently (categorical → frequency table, continuous → best-fitting scipy distribution). Useful when the log is small or attributes are largely independent.
'empirical': samples complete attribute tuples from the observed joint distribution — preserves correlations between attributes.

Citation

Version v0.1.0 of Prosit corresponds to the implementation presented in the following paper. Please cite it if you use Prosit in academic work:

Vinci, F., Park, G., van der Aalst, W.M.P., de Leoni, M. (2026). Reliable and Configurable Process Simulations via Probabilistic White-Box Models. In: Aiello, M., Deng, S., Murillo, JM., Georgievski, I., Benatallah, B., Wang, Z. (eds) Service-Oriented Computing. ICSOC 2025. Lecture Notes in Computer Science, vol 16321. Springer, Singapore. https://doi.org/10.1007/978-981-95-5015-9_24

BibTeX:

@inproceedings{vinci2026prosit,
  author    = {Vinci, Francesco and Park, Gyunam and van der Aalst, Wil M. P. and de Leoni, Massimiliano},
  title     = {Reliable and Configurable Process Simulations via Probabilistic White-Box Models},
  booktitle = {Service-Oriented Computing -- ICSOC 2025},
  editor    = {Aiello, Marco and Deng, Shuiguang and Murillo, Juan M. and Georgievski, Ilche and Benatallah, Boualem and Wang, Zhongjie},
  series    = {Lecture Notes in Computer Science},
  volume    = {16321},
  publisher = {Springer, Singapore},
  year      = {2026},
  doi       = {10.1007/978-981-95-5015-9_24}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.3

May 2, 2026

1.0.2

Apr 28, 2026

This version

1.0.1

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prosit_pm-1.0.1.tar.gz (39.8 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prosit_pm-1.0.1-py3-none-any.whl (47.5 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file prosit_pm-1.0.1.tar.gz.

File metadata

Download URL: prosit_pm-1.0.1.tar.gz
Upload date: Apr 28, 2026
Size: 39.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for prosit_pm-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`01bdd415ebea4a03997226542978a79322a813aecff09d172eece5c9493bb826`
MD5	`7dc78f4953af2c447733b4d3b5d7a20c`
BLAKE2b-256	`16a9ce017190f2a5d75b4f729ccf6c6988c85da0be9c1bd46dc0a71a50bf9cb7`

See more details on using hashes here.

File details

Details for the file prosit_pm-1.0.1-py3-none-any.whl.

File metadata

Download URL: prosit_pm-1.0.1-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 47.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for prosit_pm-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`829d9cf139c7590189dd41914dbd308261481e9806478ffe783c829b7c194be5`
MD5	`ae60eaa519c8ed1acc27b266038643f3`
BLAKE2b-256	`c8f9c3096207d67a72a8e26ec4b42207fc8674004ffd877e5799d2329fac5d63`

See more details on using hashes here.

prosit-pm 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Prosit — PROcess SImulation Tool

Table of Contents

Installation

Option 1 — Conda (recommended)

Option 2 — pip

Dependencies

Quick Start

Core Concepts

What is discovered

Rules mode vs. no-rules mode

Batch vs. incremental discovery

API Reference

SimulatorParameters

discover_from_eventlog

to_json / from_json

SimulatorEngine

apply

Discovery Options

Choosing max_depth_tree

Controlling leaf size (min_samples_leaf_cv)

Incremental discovery

Simulation Options

Basic simulation

Deterministic simulation

Multitasking

Output format

Save and Load Parameters

Advanced Usage

Full workflow with evaluation

No-rules mode (pure distributions)

Reproducible simulation

Inspecting discovered parameters

What the Models Learn

Features used per model

Distribution fitting

Data attribute modeling

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`discover_from_eventlog`

`to_json` / `from_json`

`apply`

Choosing `max_depth_tree`

Controlling leaf size (`min_samples_leaf_cv`)