Skip to main content

The AuChAnn (Automatic CHAT Annotation) package can generate CHAT annotations based on a transcript-correction pairs of utterances.

Project description

AuChAnn

Actions Status

pypi auchann

AuChAnn is a python package that provides Automatic CHAT Annotation based on a transcript string and an interpretation (or 'corrected') string. For example, when given: Transcript: 'Ik wilt nu eh na huis' Correction: 'Ik wil nu naar huis'

AuChAnn produces: CHAT-Annotation: 'ik wilt [: wil] nu &-eh na(ar) [* s:r:prep] huis'

CHAT is an annotation convention that was developed for the CHILDES corpus (MacWinney, 2000) and is used by many linguists to annotate speech. For more information on CHAT, you can read their manual: https://talkbank.org/manuals/CHAT.html.

AuChAnn was specifically developed to enhance linguistic data in the form of a transcript and interpretation by a linguist for use with SASTA (https://github.com/CentreForDigitalHumanities/sasta)

Getting Started

You can install AuChAnn using pip:

pip install auchann

You can also optionally install Sastadev which is used for detecting inflection errors.

pip install auchann[NL]

When installed, the program can be run interactively from the console using the command auchann .

Import as Library

To use AuChAnn in your own python applications, you can import the align_words function from align_words, see below. This is the main functionality of the package.

from auchann.align_words import align_words

transcript = input("Transcript: ")
correction = input("Correction: ")
alignment = align_words(transcript, correction)
print(alignment)

Settings

Various settings can be adjusted. Default values are used for every unchanged property.

from auchann.align_words import align_words, AlignmentSettings
import Levenshtein

settings = AlignmentSettings()

# Return the edit distance between the original and correction
settings.calc_distance = lambda original, correction: Levenshtein.distance(original, correction)

# Return an override of the distance and the error type;
# if error type is None the distance returned will be ignored
# Default method detects inflection errors
settings.detect_error = lambda original, correction: (1, "m") if original == "geloopt" and correction == "liep" else (0, None)

### Sastadev contains a helper function for Dutch which detects inflection errors
from sastadev.deregularise import detect_error
settings.detect_error = detect_error

# How many words could be split from one?
# e.g. das -> da(t) (i)s requires a lookahead of 2
# hoest -> hoe (i)s (he)t requires a lookahead of 3
settings.lookahead = 5

# Allow detection of replacements within a group
# e.g. swapping articles this will then be marked with
# the specified key

# EXAMPLE:
# Transcript: de huis
# Correction: het huis
# de [: het] [* s:r:gc:art] huis
settings.replacements = {
    's:r:gc:art': ['de', 'het', 'een'],
    's:r:gc:pro': ['dit', 'dat', 'deze'],
    's:r:prep': ['aan', 'uit']
}

# Other lists to adjust
settings.fillers = ['eh', 'hm', 'uh']
settings.fragments = ['ba', 'to', 'mu']

### Example usage
transcript = input("Transcript: ")
correction = input("Correction: ")
alignment = align_words(transcript, correction, settings)
print(alignment)

How it Works

The align_words function scans the transcript and correction and determines for each token whether a correction token is copied exactly from the transcript, replaces a token from the transcript, is inserted, or whether a transcript token has been omitted. Based on which of these operations has occurred, the function adds the appropriate CHAT annotation to the output string.

The algorithm uses edit distance to establish which words are replacements of each other, i.e. it links a transcript token to a correction token. Words with the lowest available edit distance are matched together, and based on this match the operations COPY and REPLACE are determined. If two candidates have the same edit distance to a token, word position is used to determine the match. The operations REMOVE and INSERT are established if no suitable match can be found for a transcript and correction token respectively.

In addition to establishing these four operations, the function detects several other properties of the transcript and correction which can be expressed in CHAT. For example, it determines whether a word is a filler or fragment, whether a conjugation error has occurred, or if a pronoun, preposition, or article has been used incorrectly.

Development

To install the requirements:

pip install -r requirements.txt

To run the AuChAnn command-line function from the console:

python -m auchann

Run Tests

pip install pytest
pytest

Upload to PyPi

pip install pip-tools twine
python setup.py sdist
twine upload dist/*.tar.gz

Acknowledgments

The research for this software was made possible by the CLARIAH-PLUS project financed by NWO (Grant 184.034.023).

References

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auchann-0.3.0.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auchann-0.3.0-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file auchann-0.3.0.tar.gz.

File metadata

  • Download URL: auchann-0.3.0.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for auchann-0.3.0.tar.gz
Algorithm Hash digest
SHA256 de3f6a63f6e14d661a97b5c3af6a49a4dc0df5c5821562dad362d86c09dccb29
MD5 edd1b294adf492e0f73274f6d596064b
BLAKE2b-256 18ef852f45182472cccffd1e826844341b0513c088194ed694f8feee11946cd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for auchann-0.3.0.tar.gz:

Publisher: release.yml on CentreForDigitalHumanities/auchann

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file auchann-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: auchann-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for auchann-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8ef0ea759ccb4a4d0952113114eb631a6e0d614b04593374ab7feacd8791959
MD5 589a205d615b8387cb9041c629d17907
BLAKE2b-256 c9f78fe1aa95707577ad081a939f14c285291af3763e73698a682bb92864a324

See more details on using hashes here.

Provenance

The following attestation bundles were made for auchann-0.3.0-py3-none-any.whl:

Publisher: release.yml on CentreForDigitalHumanities/auchann

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page