Python package with the implementation of different distance measures between two event logs, from the control-flow, temporal, and queuing perspectives.

These details have not been verified by PyPI

Project description

Event Log Distance Measures

build version

Python package with the implementation of different distance measures between two event logs, from the control-flow, temporal, and workforce perspectives:

Control-flow
- N-Gram Distribution Distance
- Control-Flow Log Distance (CFLD)
Temporal
- Absolute Event Distribution Distance
- Case Arrival Distribution Distance
- Circadian Event Distribution Distance
- Relative Event Distribution Distance
- Cycle Time Distribution Distance
Workforce
- Circadian Workforce Distribution

Installation

Package available in PyPI: https://pypi.org/project/log-distance-measures/. Install it with:

pip install log-distance-measures

Example of input initialization

import pandas as pd

from log_distance_measures.config import EventLogIDs

# Set event log column ID mapping
event_log_ids = EventLogIDs(  # These values are stored in DEFAULT_CSV_IDS
    case="case_id",
    activity="Activity",
    start_time="start_time",
    end_time="end_time"
)
# Read and transform time attributes
event_log = pd.read_csv("/path/to/event_log.csv")
event_log[event_log_ids.start_time] = pd.to_datetime(event_log[event_log_ids.start_time], utc=True)
event_log[event_log_ids.end_time] = pd.to_datetime(event_log[event_log_ids.end_time], utc=True)

Control-flow Log Distance (CFLD)

Distance measure between two event logs with the same number of traces (L1 and L2) comparing the control-flow dimension (see "Camargo M, Dumas M, González-Rojas O. 2021. Discovering generative models from event logs: data-driven simulation vs deep learning. PeerJ Computer Science 7:e577 https://doi.org/10.7717/peerj-cs.577" for a detailed description of a similarity version of this measure).

Transform each process trace of L1 and L2 to their corresponding activity sequence.
Compute the Damerau-Levenshtein distance between each trace i from L1 and each trace j of L2, and normalize it by dividing by the length of the longest trace.
Compute the matching between the traces of both logs (such that each i is matched to a different j, and vice versa) minimizing the sum of distances with linear programming.
Compute the CFLD as the average of the normalized distance values.

Example of use

from log_distance_measures.config import DEFAULT_CSV_IDS
from log_distance_measures.control_flow_log_distance import control_flow_log_distance

# Call passing the event logs, and its column ID mappings
distance = control_flow_log_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
)

N-Gram Distribution Distance

Distance measure between two event logs computing the difference in the frequencies of the n-grams observed in the event logs (being the n-grams of an event log all the groups of n consecutive elements observed in it).

Given a size n, get all sequences of n activities (n-gram) observed in each event log (adding artificial activities to the start and end of each trace to consider these as well, e.g., 0 - 0 - A for a trace starting with A and an n = 3).
Compute the number of times that each n-gram is observed in each event log (its frequency).
Compute the sum of absolute differences between the frequencies of all computed n-grams (e.g. the frequency of A - B - C in the first event log w.r.t. its frequency in the second event log).

Example of use

from log_distance_measures.config import DEFAULT_CSV_IDS
from log_distance_measures.n_gram_distribution import n_gram_distribution_distance

# Call passing the event logs, and its column ID mappings
distance = n_gram_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    n=3  # trigrams
)

Absolute Event Distribution Distance

Distance measure computing how different the histograms of the timestamps of two event logs are, discretizing the timestamps by absolute hour.

Take all the start timestamps, the end timestamps, or both.
Discretize the timestamps by absolute hour (those timestamps between 02/05/2022 10:00:00 and 02/05/2022 10:59:59 go to the same bin).
Compare the discretized histograms of the two event logs with the Wasserstein Distance (a.k.a. EMD).

Example of use

from log_distance_measures.absolute_event_distribution import absolute_event_distribution_distance
from log_distance_measures.config import AbsoluteTimestampType, DEFAULT_CSV_IDS, discretize_to_hour

# Call passing the event logs, its column ID mappings, timestamp type, and discretize function
distance = absolute_event_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    discretize_type=AbsoluteTimestampType.BOTH,  # Which timestamps to consider (start times and/or end times)
    discretize_event=discretize_to_hour  # Function to discretize the time of each timestamp (default by hour)
)

This EMD measure can be also used to compare the distribution of the start timestamps ( with AbsoluteHourEmdType.START), or the end timestamps (with AbsoluteHourEmdType.END), instead of both of them.

Furthermore, the binning is performed to hour by default, but it can be customized passing another function discretize the total amount of seconds to its bin.

import math

from log_distance_measures.absolute_event_distribution import absolute_event_distribution_distance
from log_distance_measures.config import AbsoluteTimestampType, DEFAULT_CSV_IDS, discretize_to_day

# EMD of the (END) timestamps distribution where each bin represents a day
distance = absolute_event_distribution_distance(
    original_log, DEFAULT_CSV_IDS,
    simulated_log, DEFAULT_CSV_IDS,
    discretize_type=AbsoluteTimestampType.END,
    discretize_event=discretize_to_day
)

# EMD of the timestamps distribution where each bin represents a week
distance = absolute_event_distribution_distance(
    original_log, DEFAULT_CSV_IDS,
    simulated_log, DEFAULT_CSV_IDS,
    discretize_event=lambda seconds: math.floor(seconds / 3600 / 24 / 7)
)

Case Arrival Distribution Distance

Distance measure computing how different the discretized histograms of the arrival events of two event logs are.

Compute the arrival timestamp for each process case (its first start time).
Discretize the timestamps by absolute hour (those timestamps between 02/05/2022 10:00:00 and 02/05/2022 10:59:59 go to the same bin).
Compare the discretized histograms of the two event logs with the Wasserstein Distance (a.k.a. EMD).

Example of use

from log_distance_measures.case_arrival_distribution import case_arrival_distribution_distance
from log_distance_measures.config import DEFAULT_CSV_IDS, discretize_to_hour

distance = case_arrival_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    discretize_event=discretize_to_hour  # Function to discretize each timestamp (default by hour)
)

Circadian Event Distribution Distance

Distance measure computing how different the histograms of the timestamps of two event logs are, comparing all the instants recorded in the same weekday together, and discretizing them to the hour in the day.

Take all the start timestamps, the end timestamps, or both.
Group the timestamps by their weekday (e.g. all the timestamps recorded on Monday of one log are going to be compared with the timestamps recorded on Monday of the other event log).
Discretize the timestamps to their hour (those timestamps between '10:00:00' and '10:59:59' go to the same bin).
Compare the histograms of the two event logs for each weekday (with the Wasserstein Distance, a.k.a. EMD), and compute the average.

Extra 1: If there are no recorded timestamps for one of the weekdays in both logs, no distance is measured for that day. Extra 2: If there are no recorded timestamps for one of the weekdays in one of the logs, the distance for that day is set to 23 (the maximum distance for two histograms with values from 0 to 23)

Example of use

from log_distance_measures.circadian_event_distribution import circadian_event_distribution_distance
from log_distance_measures.config import AbsoluteTimestampType, DEFAULT_CSV_IDS

distance = circadian_event_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    discretize_type=AbsoluteTimestampType.BOTH  # Consider both start/end timestamps of each activity instance
)

Similar to the Absolute Event Distribution Distance, the Circadian Event Distribution Distance can be also used to compare the distribution of the start timestamps (with AbsoluteHourEmdType.START), or the end timestamps ( with AbsoluteHourEmdType.END), instead of both of them.

Relative Event Distribution Distance

Distance measure computing how different the histograms of the relative (w.r.t. the start of each case) timestamps of two event logs are, discretizing the timestamps by absolute hour.

Take all the start timestamps, the end timestamps, or both.
Make them relative w.r.t. the start of their process case (e.g. the first timestamp in a case is 0, the second one is the time interval from the first one).
Discretize the durations by hour (e.g. those durations between 0 and 3599 go to the same bin).
Compare the discretized histograms of the two event logs with the Wasserstein Distance (a.k.a. EMD).

Example of use

from log_distance_measures.config import AbsoluteTimestampType, DEFAULT_CSV_IDS, discretize_to_hour
from log_distance_measures.relative_event_distribution import relative_event_distribution_distance

# Call passing the event logs, its column ID mappings, timestamp type, and discretize function
distance = relative_event_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    discretize_type=AbsoluteTimestampType.BOTH,  # Which timestamps to consider (start times and/or end times)
    discretize_event=discretize_to_hour  # Function to discretize the time of each timestamp (default by hour)
)

Similar to the Absolute Event Distribution Distance, the Relative Event Distribution Distance can be also used to compare the distribution of the start timestamps (with AbsoluteHourEmdType.START), or the end timestamps ( with AbsoluteHourEmdType.END), instead of both of them.

Cycle Time Distribution Distance

Distance measure computing how different the cycle time discretized histograms of two event logs are.

Compute the cycle time of each process instance.
Group the cycle times in bins by a given bin size (time gap).
Compare the discretized histograms of the two event logs with the Wasserstein Distance (a.k.a. EMD).

Example of use

import pandas as pd

from log_distance_measures.config import DEFAULT_CSV_IDS
from log_distance_measures.cycle_time_distribution import cycle_time_distribution_distance

distance = cycle_time_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    bin_size=pd.Timedelta(hours=1)  # Bins of 1 hour
)

Remaining Time Distribution Distance

In situations where the start of the log was sliced at a specific timestamp (reference_point), some cases may be partially included as they were ongoing at time reference_point. We consider their duration from reference_point until their end as their "remaining cycle time". This distance measure computes how different the remaining cycle times of the cases (ongoing at a point reference_point) of two event logs are (as discretized histograms).

Compute the remaining cycle time of each ongoing case as the difference between its last activity instance end (i.e., case end) and the reference point.
Group the remaining times in bins by a given bin size (time gap).
Compare the discretized histograms of the two event logs with the Wasserstein Distance (a.k.a. EMD).

Example of use

import pandas as pd

from log_distance_measures.config import DEFAULT_CSV_IDS
from log_distance_measures.remaining_time_distribution import remaining_time_distribution_distance

distance = remaining_time_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
    reference_point=pd.Timestamp("2025-02-20T10:00:00.000+02:00"),  # Timestamp considered as reference point
    bin_size=pd.Timedelta(hours=1)  # Bins of 1 hour
)

Circadian Workforce Distribution Distance

Distance measure computing how different the histograms of the number of active resources of two event logs are, comparing the number of active resources of each hour of each weekday.

For each hour in the timeline of the log, count the number of unique resources that recorded an event within it.
Group the number of active resources per hour by their weekday.
For each hour of each weekday, compute the average of the number of active resources.
Compare the histograms of the two event logs for each weekday (with the Wasserstein Distance, a.k.a. EMD), and compute the average.

Extra 1: If there are no recorded active resources for one of the weekdays in both logs, no distance is measured for that day. Extra 2: If there are no recorded active resources for one of the weekdays in one of the logs, the distance for that day is set to 23 (the maximum distance for two histograms with values from 0 to 23)

Example of use

from log_distance_measures.circadian_workforce_distribution import circadian_workforce_distribution_distance
from log_distance_measures.config import DEFAULT_CSV_IDS

distance = circadian_workforce_distribution_distance(
    original_log, DEFAULT_CSV_IDS,  # First event log and its column id mappings
    simulated_log, DEFAULT_CSV_IDS,  # Second event log and its column id mappings
)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.1.0

Feb 20, 2025

2.0.2

Jan 24, 2025

2.0.1

Jan 23, 2025

2.0.0

Jan 19, 2024

1.1.0

Nov 15, 2023

1.0.3

Nov 13, 2023

1.0.2

May 10, 2023

1.0.1

May 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

log_distance_measures-2.1.0.tar.gz (20.4 kB view details)

Uploaded Feb 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

log_distance_measures-2.1.0-py3-none-any.whl (27.8 kB view details)

Uploaded Feb 20, 2025 Python 3

File details

Details for the file log_distance_measures-2.1.0.tar.gz.

File metadata

Download URL: log_distance_measures-2.1.0.tar.gz
Upload date: Feb 20, 2025
Size: 20.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for log_distance_measures-2.1.0.tar.gz
Algorithm	Hash digest
SHA256	`547f40b2c2f96dcb8227808e2d5473308c8a521d402e12e26bdd04409fcb32ce`
MD5	`f60fa875696e7f43aa85cef1452354a9`
BLAKE2b-256	`8a411c44d81bbea6c37da37c851a9bda48a903dd2c42670d89c29215c0cfb28b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for log_distance_measures-2.1.0.tar.gz:

Publisher: build.yaml on AutomatedProcessImprovement/log-distance-measures

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: log_distance_measures-2.1.0.tar.gz
- Subject digest: 547f40b2c2f96dcb8227808e2d5473308c8a521d402e12e26bdd04409fcb32ce
- Sigstore transparency entry: 172959172
- Sigstore integration time: Feb 20, 2025
Source repository:
- Permalink: AutomatedProcessImprovement/log-distance-measures@4d66266568043a9045d37fe069cb682a359bfb88
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AutomatedProcessImprovement
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build.yaml@4d66266568043a9045d37fe069cb682a359bfb88
- Trigger Event: push

File details

Details for the file log_distance_measures-2.1.0-py3-none-any.whl.

File metadata

Download URL: log_distance_measures-2.1.0-py3-none-any.whl
Upload date: Feb 20, 2025
Size: 27.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for log_distance_measures-2.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a972820845dc65471e7176bbfb5d1e25f6476d63a44a3f53679bb88218775f1a`
MD5	`978c5addb9efac97d563af71d7b1dbb9`
BLAKE2b-256	`87dbd932beffb5ff35c3dfe6ca08e6e8a9f810581258afaa2b37b28610b2a02c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for log_distance_measures-2.1.0-py3-none-any.whl:

Publisher: build.yaml on AutomatedProcessImprovement/log-distance-measures

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: log_distance_measures-2.1.0-py3-none-any.whl
- Subject digest: a972820845dc65471e7176bbfb5d1e25f6476d63a44a3f53679bb88218775f1a
- Sigstore transparency entry: 172959174
- Sigstore integration time: Feb 20, 2025
Source repository:
- Permalink: AutomatedProcessImprovement/log-distance-measures@4d66266568043a9045d37fe069cb682a359bfb88
- Branch / Tag: refs/heads/main
- Owner: https://github.com/AutomatedProcessImprovement
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build.yaml@4d66266568043a9045d37fe069cb682a359bfb88
- Trigger Event: push

log-distance-measures 2.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Event Log Distance Measures

Installation

Example of input initialization

Control-flow Log Distance (CFLD)

Example of use

N-Gram Distribution Distance

Example of use

Absolute Event Distribution Distance

Example of use

Case Arrival Distribution Distance

Example of use

Circadian Event Distribution Distance

Example of use

Relative Event Distribution Distance

Example of use

Cycle Time Distribution Distance

Example of use

Remaining Time Distribution Distance

Example of use

Circadian Workforce Distribution Distance

Example of use

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance