A Python implementation of the NIST md-eval.pl script for evaluating rich transcription accuracy.

Project description

NIST md-eval in Python

A Python implementation of the NIST md-eval.pl script for evaluating rich transcription and speaker diarization accuracy. This tool mimics the core functionality and scoring logic of the standard Perl script used in NIST evaluations (e.g., RT-0x), focusing on Diarization Error Rate (DER).

Overview
Installation
Usage
- Command Line Interface
- Python API
Input Formats
- RTTM (Rich Transcription Time Marked)
- UEM (Un-partitioned Evaluation Map)
Core Algorithms
Testing
Citation

Overview

mdeval calculates the Diarization Error Rate (DER) by comparing a system hypothesis (SYS) against a ground truth reference (REF). It supports:

Missed Speech: Speech present in REF but not in SYS.
False Alarm: Speech present in SYS but not in REF.
Speaker Error: Speech assigned to the wrong speaker (after optimal mapping).
Collars: Optional no-score zones around reference segment boundaries.
Overlap handling: Option to exclude regions where multiple reference speakers talk simultaneously.

The goal is to provide a pure Python, dependency-free (or minimal dependency) alternative to the legacy Perl script for modern pipelines.

Installation

You can install the package via pip:

pip install mdeval

Usage

Command Line Interface

The package provides a CLI entry point mdeval.

python3 -m mdeval.cli -r <ref_rttm> -s <sys_rttm> [options]

Arguments:

-r, --ref: Path to the Reference RTTM file (Required).
-s, --sys: Path to the System/Hypothesis RTTM file (Required).
-u, --uem: Path to the UEM file defining evaluation regions (Optional. If omitted, the valid region is inferred from the Reference RTTM).
-c, --collar: Collar size in seconds (Float, default: 0.0). A "no-score" zone of +/- collar seconds is applied around every reference segment boundary.
-1, --single-speaker: Limit scoring to single-speaker regions only (ignore overlaps in REF). This is equivalent to "Overlap Exclusion".

Example:

python3 -m mdeval.cli -r ref.rttm -s hyp.rttm -c 0.25

Python API

You can use the scoring logic programmatically:

from mdeval.io import load_rttm, load_uem
from mdeval.scoring import score_speaker_diarization
from mdeval.utils import Segment

# Load Data
ref_data = load_rttm('ref.rttm')
sys_data = load_rttm('sys.rttm')

# Define Evaluation Map (or infer it)
# uem_eval = [Segment(0.0, 100.0)]
# Or load:
# uem_data = load_uem('test.uem')
# uem_eval = uem_data['file1']['1']

# Parse specific file/channel data
ref_spkrs = {} # ... extract from ref_data['file1']['1']['SPEAKER']
sys_spkrs = {} # ... extract from sys_data['file1']['1']['SPEAKER']

# Score
stats, mapping = score_speaker_diarization(
    'file1', '1', 
    ref_spkrs, sys_spkrs, 
    uem_eval, 
    collar=0.25, 
    ignore_overlap=False
)

print(f"DER: {stats['MISSED_SPEAKER'] + stats['FALARM_SPEAKER'] + stats['SPEAKER_ERROR']}")

Input Formats

RTTM (Rich Transcription Time Marked)

Format used for both Reference and System inputs. Space-delimited text file. Lines starting with ; or # are ignored.

Required Columns (indices 0-8):

TYPE: Segment type (must be SPEAKER to be scored).
FILE: File name / Recording ID.
CHNL: Channel ID (e.g., 1).
TBEG: Start time in seconds (float).
TDUR: Duration in seconds (float).
ORTHO: Orthography field (ignored/placeholder, e.g., <NA>).
STYPE: Subtype (ignored/placeholder, e.g., <NA>).
NAME: Speaker Name/ID.
CONF: Confidence score (ignored/placeholder, e.g., <NA>).

Example:

SPEAKER file1 1 0.00 5.00 <NA> <NA> spk1 <NA> <NA>
SPEAKER file1 1 5.00 3.00 <NA> <NA> spk2 <NA> <NA>

UEM (Un-partitioned Evaluation Map)

Defines the time regions that should be evaluated. Regions outside the UEM are ignored. Space-delimited text file.

Required Columns:

FILE: File name.
CHNL: Channel ID.
TBEG: Start time of valid region.
TEND: End time of valid region.

Example:

file1 1 0.00 100.00
file1 1 120.00 300.00

Core Algorithms

Scoring Logic

The scoring is segment-based (time-weighted).

Metric: Diarization Error Rate (DER). $$ DER = \frac{\text{Missed Speaker Time} + \text{False Alarm Speaker Time} + \text{Speaker Error Time}}{\text{Total Scored Speaker Time}} $$
Segmentation: The timeline is split into contiguous segments where the set of reference and system speakers remains constant.
Intersection: For each segment, the number of reference speakers ($N_{ref}$) and system speakers ($N_{sys}$) is compared.

Optimal Speaker Mapping

Since System speaker labels (e.g., "sys01") do not match Reference labels (e.g., "spk01"), a global 1-to-1 mapping is computed to minimize error.

We compute an overlap matrix between every reference speaker and every system speaker over the entire valid UEM duration.
The Hungarian Algorithm (implemented purely in Python, no scipy dependency required) is used to find the optimal assignment that maximizes total overlap time.

Collars

When collar > 0, a "no-score" zone is applied.

For every segment boundary in the Reference RTTM, a region of $t \pm collar$ is removed from the UEM.
This accounts for human annotation uncertainty boundaries.
Note: The Python implementation follows the logic of md-eval.pl's add_collars_to_uem subroutine, using a counter-based approach to subtract the union of all collar regions from the scoring UEM.

Overlap Exclusion

If enabled (via -1 / --single-speaker), regions where two or more Reference speakers are speaking simultaneously are removed from the UEM.

This allows evaluation of systems that only output single-speaker segments.
Note: Overlap exclusion is applied before collars in the perl script logic, but effectively they both just subtract time from the valid UEM.

Testing

The package includes unit tests using Python's unittest framework.

Run tests via:

python3 -m unittest discover tests

Citation

We developed this package as part of the following work:

@inproceedings{wang2018speaker,
  title={{Speaker Diarization with LSTM}},
  author={Wang, Quan and Downey, Carlton and Wan, Li and Mansfield, Philip Andrew and Moreno, Ignacio Lopz},
  booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={5239--5243},
  year={2018},
  organization={IEEE}
}

@inproceedings{xia2022turn,
  title={{Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection}},
  author={Wei Xia and Han Lu and Quan Wang and Anshuman Tripathi and Yiling Huang and Ignacio Lopez Moreno and Hasim Sak},
  booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={8077--8081},
  year={2022},
  organization={IEEE}
}

@article{wang2022highly,
  title={Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering},
  author={Quan Wang and Yiling Huang and Han Lu and Guanlong Zhao and Ignacio Lopez Moreno},
  journal={arXiv:2210.13690},
  year={2022}
}

Project details

Release history Release notifications | RSS feed

0.1.3

Jan 12, 2026

This version

0.1.2

Jan 12, 2026

0.1.1

Jan 12, 2026

0.1.0

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdeval-0.1.2.tar.gz (13.8 kB view details)

Uploaded Jan 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mdeval-0.1.2-py3-none-any.whl (14.3 kB view details)

Uploaded Jan 12, 2026 Python 3

File details

Details for the file mdeval-0.1.2.tar.gz.

File metadata

Download URL: mdeval-0.1.2.tar.gz
Upload date: Jan 12, 2026
Size: 13.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mdeval-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`538c273c2af131482c10e76a60c124f735f3dde5104dfedebd3a01fb11f5b91e`
MD5	`31d9c1d1d3cf67a69ce7157fef7bfefb`
BLAKE2b-256	`759e4d64246fecd6469cef73cd7e3a10fd329108bc38142a95f8abdbb10a8973`

See more details on using hashes here.

File details

Details for the file mdeval-0.1.2-py3-none-any.whl.

File metadata

Download URL: mdeval-0.1.2-py3-none-any.whl
Upload date: Jan 12, 2026
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for mdeval-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b247c72897e68a0ea4acae24e7fb287715beaaf6ab3b490035abb59265a9fa21`
MD5	`170c135225bfc8766ebebff8cff4492d`
BLAKE2b-256	`9059c9fa28118dfbcb46fa147117dcb7f7577080a4e166412ee34cb3d0d38ab8`

See more details on using hashes here.

mdeval 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

NIST md-eval in Python

Table of Contents

Overview

Installation

Usage

Command Line Interface

Python API

Input Formats

RTTM (Rich Transcription Time Marked)

UEM (Un-partitioned Evaluation Map)

Core Algorithms

Scoring Logic

Optimal Speaker Mapping

Collars

Overlap Exclusion

Testing

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes