Laboratory-specific discourse analysis tools in the DIAAD lineage

These details have not been verified by PyPI

Project description

RASCAL - Resources for Analyzing Speech in Clinical Aphasiology Labs

PyPI version Python License Zenodo DOI

Open-source Python infrastructure for reproducible discourse analysis in clinical aphasiology.

⚠️ Development Notice

Active development of the generalized discourse-analysis engine now continues under the DIAAD project.

This repository remains available for historical continuity and may later serve as a wrapper or specialized extension of DIAAD.

Description

RASCAL is an open-source Python package designed to enhance efficiency, rigor, and replicability in a variety of monologic discourse analyses common in clinical aphasiology. Initially, the program tabularizes CHAT-formatted (.cha) transcripts with unique identifiers for samples and utterances, facilitating hierarchical organization through filename-derived grouping variables. While structured around our laboratory's workflow, users can run any selection of RASCAL's modules, which include algorithmic transcription reliability evaluation, complete utterance (CU) coding summarization, and batched core lexicon (CoreLex) analysis. The program aims to help researchers and clinicians streamline discourse analysis pipelines, particularly when managing large datasets.

Installation

We recommend installing RASCAL into a dedicated virtual environment using Anaconda:

1. Create and activate your environment:

conda create --name rascal python=3.12
conda activate rascal

2. Download RASCAL:

# directly from PyPI
pip install rascal-speech

# or from GitHub
pip install git+https://github.com/nmccloskey/rascal.git@main

Analysis Pipeline

BU-TU Semi-Automated Monologic Narrative Analysis Overview

Stage 0 (ASR + Manual): Complete transcription for all samples.
Stage 1 (RASCAL):
- Input: Transcriptions (.cha)
- Output: Transcription reliability samples
Stage 2 (Manual): Transcribe reliability samples
Stage 3 (RASCAL):
- Input: Original & reliability transcriptions
- Output: Transcription reliability reports, reselected reliability samples
Stage 4 (RASCAL):
- Input: Transcriptions (.cha)
- Output: Transcript tables, CU coding & reliability files
Stage 5 (Manual): CU coding and reliability checks
Stage 6 (RASCAL):
- Input: Manually completed CU coding files
- Output: CU reliability analysis, reselected CU reliability samples
Stage 7 (RASCAL):
- Input: CU coding files
- Output: CU coding summaries, word count files
Stage 8 (Manual): Word counting and reliability checks, record speaking times
Stage 9 (RASCAL):
- Input: Manually completed word count files
- Output: Word count reliability analysis, reselected reliability samples
Stage 10 (RASCAL):
- Input: Transcript tables (with speaking times), CU Coding summaries & word count files
- Output: Blind & unblind summaries, CoreLex analysis

Links

GitHub Repository

Access the full RASCAL source code, issue tracker, and development history:

👉 RASCAL on GitHub

Web App

Run RASCAL directly in your browser — no local installation required:

👉 RASCAL Web App

PyPI Distribution

Install the stable RASCAL package from the Python Package Index:

👉 rascal-speech on PyPI

pip install rascal-speech

Zenodo Archival Release

A fully versioned, citable archive of RASCAL (v1.0.0) is available on Zenodo, including:

RASCAL Instruction Manual (PDF)
Synthetic, non-identifiable example data for all major functionalities
MIT license text
A release-specific README

The example data include:

toy_data/: minimal example inputs.
function_*: representative output directories with metadata.
Please ensure HIPAA compliance when analyzing real clinical data.

👉 Zenodo record
📌 DOI: 10.5281/zenodo.17624073

Setup

To prepare for running RASCAL, complete the following steps:

1. Create your working directory:

We recommend creating a fresh project directory where you'll run your analysis.

Example structure:

your_project/
├── config.yaml           # Configuration file (see below)
└── rascal_data/
    └── input/            # Place your CHAT (.cha) files and/or Excel data here
                          # (RASCAL will make an output directory)

2. Provide a `config.yaml` file

This file specifies the directories, coders, reliability settings, and tier structure.

You can download the example config file from the repo or create your own like this:

input_dir: rascal_data/input
output_dir: rascal_data/output
random_seed: 8
reliability_fraction: 0.2
coders:
- '1'
- '2'
- '3'
CU_paradigms:
- SAE
- AAE
exclude_participants:
- INV
strip_clan: true
prefer_correction: true
lowercase: true
tiers:
  site:
    values:
    - AC
    - BU
    - TU
    partition: true
    blind: true
  test:
    values:
    - Pre
    - Post
    - Maint
    blind: true
  study_id:
    values: (AC|BU|TU)\d+
  narrative:
    values:
    - CATGrandpa
    - BrokenWindow
    - RefusedUmbrella
    - CatRescue
    - BirthdayScene

Explanation:

General
- random_seed - ensures deterministic selections for replicability
- reliability_fraction - the proportion of data to subset for reliability (default 20%).
- coders - alphanumeric coder identifiers (2 required for function g and 3 for c, see below).
- CU_paradigms - allows users to accommodate multiple dialects if desired. If at least two paradigms are entered, parallel coding columns will be prepared and processed in all CU functions.
- exclude_participants - speakers appearing in .cha files to exclude from transcription reliability and CU coding (neutral utterances).
Transcription Reliability
- strip_clan - removes CLAN markup but preserve speech-like content, including filled pauses (e.g., '&um' -> 'um') and partial words.
- prefer_correction - toggles policy for accepted corrections '[: x] [*]': True keeps x, False keeps original.
- lowercase - toggles case regularization.

Specifying tiers: The tier system facilitates tabularization by associating a unit of analysis with its possible values and extracting this information from the file name of individual transcripts.

Multiple values: enter as a comma- or newline-separated list. These are treated as literal choices and combined into a regex internally. See below examples.
- narrative: BrokenWindow, RefusedUmbrella, CatRescue
- test: PreTx, PostTx
Single value: treated as a regular expression and validated immediately. Examples include:
- Digits only: \\d+
- Lab site + digits: (AC|BU|TU)\\d+
- Three uppercase letters + three digits: [A-Z]{3}\\d{3}
Tier attributes
- Partition: creates separate coding files and separate reliability subsets by that tier. In this example, separate CU coding files will be generated for each site (AC, BU, TU), but not for each narrative or test value.
- Blind: generates blind codes for CU summaries (function j below).

Example: Tier-Based Tabularization from Filenames (according to the above config).

Source files:

TU88PreTxBrokenWindow.cha
BU77Maintenance_CatRescue.cha

Tabularization:

Site	Test	ParticipantID	Narrative
TU	Pre	TU88	BrokenWindow
BU	Maint	BU77	CatRescue

Running the Program

Once installed, RASCAL can be run from any directory using the command-line interface. Commands can be entered in succinct, expanded, or omnibus form (see below tables).

rascal [-h] [--config CONFIG] command [command ...]

Examples:

# Reselect transcription reliability:
rascal 3b
# or
rascal "transcripts reselect"

# Prepare utterance tables and CU coding & reliability files from chat transcripts:
rascal 4
# or
rascal 4a,4b
# or
rascal "utterances make", "cus make"

# Summarize CUs and specify config (default 'config.yaml'):
rascal 10a --config other_config.yaml

Batched CoreLex:

# Minimal command for batched CoreLex analysis of .cha-formatted transcripts:
rascal 10b
# or
rascal "corelex analyze"

# To include speaking rate, run the below and fill in the speaking_time column (in seconds) before calling 10b.
rascal 4a
# or
rascal "transcripts make"

Command Modes

Mode	Example	Typical Use
Succinct	`rascal 4b`	Fast single-stage execution
Expanded	`rascal "cus make"`	Clearer readability; use quotes for multi-word commands
Omnibus	`rascal 4`	Runs all sub-steps in that stage (e.g., 4a–4b)

RASCAL Pipeline Commands

Stage (succinct command)	Expanded command	Description	Input	Output	Function name
1a	transcripts select	Select transcription reliability samples	Raw `.cha` files	Reliability & full sample lists + template `.cha` files	`select_transcription_reliability_samples`
3a	transcripts evaluate	Evaluate transcription reliability	Reliability `.cha` pairs	Agreement metrics + alignment text reports	`evaluate_transcription_reliability`
3b	transcripts reselect	Reselect transcription reliability samples	Original + reliability transcription tables (from 1a)	New reliability subset(s)	`reselect_transcription_reliability_samples`
4a	transcripts make	Prepare transcript tables	Raw `.cha` files	Sample & utterance-level spreadsheets	`make_transcript_tables`
4b	cus make	Make CU coding & reliability files	Utterance tables (from 4a)	CU coding + reliability spreadsheets	`make_cu_coding_files`
6a	cus evaluate	Analyze CU reliability	Manually completed CU coding (from 4b)	Reliability summary tables + reports	`evaluate_cu_reliability`
6b	cus reselect	Reselect CU reliability samples	Manually completed CU coding (from 4b)	New reliability subset(s)	`reselect_cu_wc_reliability`
7a	cus analyze	Analyze CU coding	Manually completed CU coding (from 4b)	Sample- and utterance-level CU analyses	`analyze_cu_coding`
7b	words make	Make word count & reliability files	CU coding tables (from 7a)	Word count + reliability spreadsheets	`make_word_count_files`
9a	words evaluate	Evaluate word count reliability	Manually completed word counts (from 7b)	Reliability summaries + agreement reports	`evaluate_word_count_reliability`
9b	words reselect	Reselect word count reliability samples	Manually completed word counts (from 7b)	New reliability subset(s)	`reselect_cu_wc_reliability`
10a	cus summarize	Summarize CU coding & word counts	CU and WC coding results	Blind + unblind utterance and sample summaries + blind codes	`summarize_cus`
10b	corelex analyze	Run CoreLex analysis	CU and WC sample summaries	CoreLex coverage and percentile metrics	`run_corelex`

Command Mappings

Omnibus command	Succinct command	Expanded command
1	1a	transcripts select
4	4a, 4b	utterances make, cus make
7	7a, 7b	cus analyze, words make
10	10a, 10b	cus summarize, corelex analyze

RASCAL Workflow

Below is the current RASCAL pipeline.

RASCAL Workflow Chart

Stages 2, 5, & 8 are entirely manual. Dashed arrows show the alternate inputs to function 10b: function 4a output is required, and recording speaking time in the samples sheet is optional but recommended.

Notes

Input Transcriptions

.cha files must be formatted correctly according to CHAT conventions.
Ensure filenames match tier values as specified in config.yaml.
RASCAL searches tier values using exact spelling and capitalization.

Transcript Tables (function 4a)

Function 4a prepares both utterance- and sample-level tabulations of CHAT-formatted transcripts in Excel files, assigning unique alphanumeric identifiers encoding level of analysis – ‘S’ for sample and ‘U’ for utterance – for example, S008 and U0246.

The transcript_tables.xlsx output contains two sheets:

samples for transcript metadata, including file name and tier values
utterances for transcript content, i.e., (CHAT-coded) utterances & comments (from %com lines)

This encoded tabularization:

establishes unique, human-readable identifiers that satisfy database logic
facilitates data management across RASCAL inputs and outputs, including joins between tables
promotes transparency and consistency in text processing
minimizes potential bias during manual coding

If not provided, these tables are automatically generated from .cha inputs for functions 4b, 7b, & 10b.

Transcription Reliability Input (function 3a)

In both the CLI and webapp versions, RASCAL function 3a matches original with reliability transcripts based on common tiers plus a reliability tag in the file name, e.g., TU88_PreTxBrokenWindow.cha & TU88PreTxBrokenWindow_reliability.cha. Function 1a generates empty .cha file templates with the reliabiilty tag for the randomly selected samples. In the CLI version, reliability samples can be collected into a /reliability subdirectory in the input folder. The tier values must match the originals, but this provides an alternative to tagging filenames.

Logs & Metadata

The logs subdirectory in the output folder contains two files describing the program run:

rascal_YYMMDD_HHMM.log - contains log messages (e.g, runtime, detected files, errors, etc.)
rascal_YYMMDD_HHMM_metadata.json - takes a snapshot of input & output directory content just before program terminates

🧪 Testing

This project uses pytest for its testing suite.
All tests are located under the tests/ directory, organized by module/function.

Running Tests

To run the full suite:

pytest

Run with verbose output:

pytest -v

Run a specific test file:

pytest tests/test_coding/test_corelex_analyze.py

Status and Contact

I warmly welcome feedback, feature suggestions, or bug reports. Feel free to reach out by:

Submitting an issue through the GitHub Issues tab
Emailing me directly at: nsm [at] temple.edu

Thanks for your interest and collaboration!

Citation

If using RASCAL in your research, please cite:

McCloskey, N., et al. (2025, April). The RASCAL pipeline: User-friendly and time-saving computational resources for coding and analyzing language samples. Poster presented at the Aphasia Access Leadership Summit, Pittsburgh, PA.

There is a formal writeup in preparation.

(in prep) McCloskey, N., Hoover, E., Vitale, S., Kohen, F., and DeDe, G. (2025) The RASCAL pipeline: User-friendly and time-saving computational resources for coding and analyzing monologic discourse samples.

Acknowledgments

RASCAL builds on and integrates functionality from two excellent open-source tools which I highly recommend to researchers and clinicians working with language data:

batchalign2 – Developed by the TalkBank team, Batchalign provides a robust backend for automatic speech recognition (ASR). RASCAL was designed to function downstream of this system, leveraging its debulletized .cha files as input. This integration allows researchers to significantly expedite batch transcription, which without an ASR springboard might bottleneck discourse analysis.

Liu H, MacWhinney B, Fromm D, Lanzi A. Automation of Language Sample Analysis. J Speech Lang Hear Res. 2023 Jul 12;66(7):2421-2433. doi: 10.1044/2023_JSLHR-22-00642. Epub 2023 Jun 22. PMID: 37348510; PMCID: PMC10555460.

coreLexicon – A web-based interface for Core Lexicon analysis developed by Rob Cavanaugh, et al (2021). RASCAL implements its own Core Lexicon analysis that has high reliability with this web app: ICC(2) values (two-way random, absolute agreement) on primary metrics were 0.9627 for accuracy (number of core words) and 0.9689 for efficiency (core words per minute) - measured on 402 narratives (Brokem Window, Cat Rescue, and Refused Umbrella) from our conversation treatment study. RASCAL does not use the webapp but accesses the normative data associated with this repository (using Google sheet IDs) to calculate percentiles.
- Inspiration & overlap: RASCAL’s output table design was directly inspired by the original web app, with many of the same fields (accuracy, efficiency, percentiles, CoreLex tokens produced).
- Enhancements: RASCAL extends this model by (a) supporting batch analysis of uploaded/input tabular data (rather than manual entry through a web interface), and (b) including some new metrics, particularly a normalized lexicon coverage, which enables aggregate comparisons across narratives.
- Recommended use cases: The original web app remains an excellent choice for users working with a small number of samples who want individualized reports, while RASCAL's CoreLex functionality fills the niche of higher-throughput analysis ready for downstream statistical workflows.

Cavanaugh, R., Dalton, S. G., & Richardson, J. (2021). coreLexicon: An open-source web-app for scoring core lexicon analysis. R package version 0.0.1.0000. https://github.com/aphasia-apps/coreLexicon

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.8

Mar 8, 2026

1.0.7

Jan 27, 2026

1.0.6

Dec 10, 2025

1.0.5

Nov 19, 2025

1.0.4

Nov 19, 2025

1.0.3

Nov 19, 2025

1.0.2

Nov 19, 2025

1.0.1

Nov 19, 2025

1.0.0

Nov 16, 2025

0.2.0

Oct 1, 2025

0.1.1

Sep 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rascal_speech-1.0.8.tar.gz (64.1 kB view details)

Uploaded Mar 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rascal_speech-1.0.8-py3-none-any.whl (63.0 kB view details)

Uploaded Mar 8, 2026 Python 3

File details

Details for the file rascal_speech-1.0.8.tar.gz.

File metadata

Download URL: rascal_speech-1.0.8.tar.gz
Upload date: Mar 8, 2026
Size: 64.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rascal_speech-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`b0c9c39e893dbdc40ef0ba6745622e412651927ba0f9b25aa30b2c92d6ebfe04`
MD5	`034b73cf5098a4ce416f6c89c1ea7b07`
BLAKE2b-256	`2cbae039f89e9b3fd3905802986c779635a5c87efd03d29b3a6783f54bd4de14`

See more details on using hashes here.

File details

Details for the file rascal_speech-1.0.8-py3-none-any.whl.

File metadata

Download URL: rascal_speech-1.0.8-py3-none-any.whl
Upload date: Mar 8, 2026
Size: 63.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for rascal_speech-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aad0df71b42dc7c4cc31fee805043f0af9c05dd6f91a54dccbddb5e352a6a5f1`
MD5	`14ffaebd4df11b0efaeacd217145ea4e`
BLAKE2b-256	`769ff87d2edc8be7b28a76885a468e1c8be49e791203ba6ce31a6885243e67af`

See more details on using hashes here.

rascal-speech 1.0.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

RASCAL - Resources for Analyzing Speech in Clinical Aphasiology Labs

Description

Installation

1. Create and activate your environment:

2. Download RASCAL:

Analysis Pipeline

BU-TU Semi-Automated Monologic Narrative Analysis Overview

Links

GitHub Repository

Web App

PyPI Distribution

Zenodo Archival Release

Setup

1. Create your working directory:

2. Provide a config.yaml file

Explanation:

Running the Program

Command Modes

RASCAL Pipeline Commands

Command Mappings

RASCAL Workflow

Notes

Input Transcriptions

Transcript Tables (function 4a)

Transcription Reliability Input (function 3a)

Logs & Metadata

🧪 Testing

Running Tests

Status and Contact

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

2. Provide a `config.yaml` file