Skip to main content

align4d: Multi-sequence alignment tools for aligning ASR and Speaker Diarization result

Project description

User Instruction

Introduction

align4d is a powerful Python package used for aligning text results from Speaker Diarization and Speech Recognition to gold standard transcript, especially when there are overlappings between speakers. This user manual provides a step-by-step guide on how to install, use and troubleshoot the package.

Mechanism

The align4d uses global alignment alignment that is a multi-sequence variant of Needleman-Wunsch algorithm to align hypothesis (results generated by Speaker Diarization and Speech Recognition models) to reference (usually gold standard transcript, which will be separated into multiple sequence if there are multiple speakers). The alignment happens on the token level. For long sequence the align4d will automatically separate the sequence into smaller segments, align them separately by finding the absolute aligned parts (called barriers), and finally assemble them together.

The align4d uses Levenshtein Distance as the measurement of the similarity between tokens while doing alignment. There can be 4 situations between each position of alignment:

  1. Fully match. Two tokens are exactly the same (Levenshtein Distance is 0).
  2. Partially match. Two tokens are not exactly the same but the Levenshtein Distance between them are within a boundary.
  3. Mismatch. Two tokens are different and the Levenshtein Distance between them exceed the boundary.
  4. Gap. Only one token is present because it is aligned to a gap (insertion or deletion of tokens).

Installation

To install align4d, you need to have Python version 3.10 or higher. Follow these steps:

  1. Open your terminal or command prompt.
  2. Type in the following command: pip install align4d
  3. Wait for the package to download and install.

Usage

Importing Align4d

To use Align4d in your Python code, you need to import it. Here's how:

from align4d import align

Aligning Text Results

Align4d can align results from Speaker Diarization and Speech Recognition. For simple and straight forward usage, the function can be used like this:

aligned_result = align.align(hypothesis, reference)

Here's the overview of all parameters of the function:

aligned_result = align.align(hypothesis: str | list[str], reference: list[list[str]], partial_bound: int = 2, segment_length: int = None, barrier_length: int = None, strip_punctuation: bool = True)

The align() function takes in 6 parameters, the hypothesis and reference are required and the other 4 of them are optional:

  1. hypothesis: This is a list of strings or a string containing tokenized text . Each string represents a word that is generated from the Speech Recognition model. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.

    hypothesis = ["ok", "I", "am", "a", "fish", "Are", "you", "Hello", "there", "How", "are", "you", "ok"]
    # or 
    hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
    
  2. reference: This is a nested list of strings containing utterance and speaker labels from the gold standard text. The first string within each secondary list represents the speaker label, the second string represents the utterance. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.

    reference = [
        ["A", "I am a fish."],
        ["B", "okay."],
        ["C", "Are you?"],
        ["D", "Hello there."],
        ["E", "How are you?"]
    ]
    
  3. partial_bound: This is an integer that specifies the boundary between partially match and mismatch in terms of the Levenshtein Distance between the two tokens in comparison. This is an optional parameter and the default value is 2.

  4. segment_length: This is a integer that specifies the minimum length of each segment in terms of the number of hypothesis tokens. By providing segment_length and barrier_length the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.

    If segment_length and barrier_length are not provided and the hypothesis length in terms of tokens is over 100, the program will automatically search the optimal segment_length between 30 and 120 and the following message will appear while doing alignment:

    segment length: 30 max hypothesis length: 13 max reference length: 12
    segment length: 31 max hypothesis length: 13 max reference length: 12
    segment length: 32 max hypothesis length: 13 max reference length: 12
    ...
    ...
    segment length: 117 max hypothesis length: 13 max reference length: 12
    segment length: 118 max hypothesis length: 13 max reference length: 12
    segment length: 119 max hypothesis length: 13 max reference length: 12
    optimal length: 119 optimal barrier length: 6
    

    If segment_length and barrier_length are not provided and the hypothesis length in terms of tokens is lower than 100, no segmentation will be performed.

    If segment_length and barrier_length are provided and both are integers less than or equal to 0, no segmentation will be performed.

    It is strongly suggested to perform auto or manual segmentation when the input sequence are long otherwise the alignment may fail because of RAM space limit.

    It is important that the segment_length and barrier_length need to be provided together to perform manual segmentation otherwise an Exception will be raised.

    Exception: Segment length or barrier length parameter incorrect or missing.
    
  5. barrier_length: This is an integer that specifies the length of parts in terms of number of tokens used to detect the absolute aligned parts. This is an optional parameter and the default value is 6 if the parameter is not specified. By providing segment_length and barrier_length the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.

    It is important that the segment_length and barrier_length need to be provided together to perform manual segmentation otherwise an Exception will be raised.

    Exception: Segment length or barrier length parameter incorrect or missing.
    
  6. strip_punctuation: This is an boolean that specifies if the align4d will strip all punctuation in the hypothesis and reference to provide more accurate alignment result or not. The default is set to True and the output will provide alignment with the original punctuation.

At this stage, the alignment function will also print out the relative information for alignment calculation, including the size of the total matrix used for storing scores for alignment, the number of speakers, the maximum score in the matrix, and the time for computation.

 matrix size: 14 5 2 3 3 4  total cell: 5040 speaker num: 5 cell max score: 21
time: 0

The align() function returns a dictionary containing the aligned results. The hypothesis will be the list of strings (tokens) as the value for the key “hypothesis”. The reference will be separated into multiple sequences according to the provided speaker label, where each sequence will be a list of strings (tokens) as the value for the key of their speaker labels. All the reference sequences will be contained in a secondary dictionary as the value for the key “reference” in the primary dictionary. In each list, each token is aligned to the positions that have the same index and the gap is denoted as “” (empty string). If there is punctuation in the input, the punctuation will be preserved in the output.

import json

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
print(json.dumps(output, indent=4))

Sample output from align() :

# content in align_result
{
		"hypothesis": ['ok', 'I', 'am', 'a', 'fish.', 'Are', 'you?', 'Hello', 'there.', 'How', 'are', 'you?', 'ok'],
    "reference": {
        "A": ['', 'I', 'am', 'a', 'fish.', '', '', '', '', '', '', '', ''],
        "B": ['okay.', '', '', '', '', '', '', '', '', '', '', '', ''],
        "C": ['', '', '', '', '', 'Are', 'you?', '', '', '', '', '', ''],
        "D": ['', '', '', '', '', '', '', 'Hello', 'there.', '', '', '', ''],
        "E": ['', '', '', '', '', '', '', '', '', 'How', 'are', 'you?', '']
    }
}

Retrieve token match result

Based on the alignment result, this tool provide function to retrieve the matching result (fully match, partially match, mismatch, gap) for each token. Use get_token_match_result() to retrieve the token level matching result.

The criterion for determining the matching result are the following (also mentioned in the Mechanism):

  1. fully match: Levenshtein Distance = 0
  2. partially match: Levenshtein Distance ≤ boundary (default to be 2)
  3. mismatch: Levenshtein Distance > boundary (default to be 2)
  4. gap: aligned to a gap

The get_token_match_result() requires 2 parameter, the align_result which is the direct return value from the previous three alignment functions, and an optional parameter partial_bound which must be the same as the partial_bound used in align() function.

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
token_match_result = align.get_token_match_result(align_result)
print(token_match_result)

The return value is a list of strings that shows the token matching result and can either be fully match, partially match, mismatch, or gap.

# possible output for get_token_match_result()
['mismatch', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'gap']

Retrieve mapping from reference to hypothesis

Based on the alignment result, this tool provide function to retrieve the mapping from each token in the reference sequences to the hypothesis sequence. Each index shows the relative position (index) in the hypothesis sequence of the non-gap token (fully match, partially match, or mismatch) from the separated reference sequences. If the index is -1, it means that the current token does not aligned to any token in the hypothesis (align to a gap).

To achieve this, use function get_align_indices(). This function requires one parameter, the align_result which is the direct return value from the previous align() function.

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
align_indices = align4d.get_token_match_result(align_result)
print(align_indices)

The return value is a dictionary containing list of integers that shows the mapping between tokens from separated reference to hypothesis. The integers are the indices of the tokens in reference sequence map to the hypothesis sequence (for example, the first token in sequence “C” is mapped to the token in hypothesis with index 5).

# possible output
{
		'A': [1, 2, 3, 4], 
		'B': [0], 
		'C': [5, 6], 
		'D': [7, 8], 
		'E': [9, 10, 11]
}

Troubleshooting

If you encounter any issues while using Align4d, try the following:

  1. Make sure you have installed Python version 3.10 or higher.
  2. Make sure you have installed the latest version of Align4d.
  3. Check the input data to make sure it is in the correct format.
    1. The length of the reference and reference_speaker_label needs to be the same.
    2. All the input strings must be encoded in the utf-8 format.
  4. For short conversation (hypothesis length ≤ 100), please use align_without_segment().

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

align4d-1.1.1.tar.gz (764.0 kB view hashes)

Uploaded Source

Built Distribution

align4d-1.1.1-py3-none-any.whl (764.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page