Skip to main content

Aggregate Linguistic Analysis of Speech Transcripts for Research

Project description

ALASTR – Aggregate Linguistic Analysis of Speech Transcripts for Research

Status: Active development (early-stage, version 0.0.1a1).
Stability: APIs, module layout, and CLI interfaces are subject to change.
Audience: Researchers and clinicians working with clinical aphasiology and SLP discourse data.

ALASTR is a Python toolkit for scalable, scriptable analysis of clinical speech and language transcripts, with an emphasis on aphasia-focused workflows. It is designed to complement existing CHAT/CLAN-based pipelines by adding reproducible batch processing, richer linguistic feature extraction, and integration with downstream statistical analyses.

While ALASTR draws on concepts and components piloted in earlier prototypes (e.g., CLATR), it is being developed as the lab-facing, aphasiology-specialized system, with a clearer focus on clinical narratives, paraphasias, disfluencies, and other discourse-level phenomena relevant to treatment and outcomes research.


Core Aims

  • Scalability: Process many transcripts in batch (across participants, timepoints, or conditions) with consistent configuration and logging.
  • Clinical relevance: Target metrics and summaries that are meaningful for aphasiology and speech–language pathology.
  • Interoperability with CHAT/CLAN: Leverage automation to populate tiers (e.g.,morphology) in CHAT-formatted (.cha) transcripts, enabling semi-automated workflows.
  • Integration with other tools: Provide hooks for metrics and outputs from systems such as RASCAL (monologic discourse analysis) and DIAAD (dialogue analysis).

High-Level Functionality (Planned / Emerging)

  • Transcript ingestion and organization

    • Read, validate, and organize transcripts (e.g., by group, site, timepoint).
    • Support CHAT-formatted transcripts, with planned adapters for other formats.
  • Linguistic feature extraction

    • Token-level and utterance-level features using spaCy and related NLP libraries.
    • Tier-aware processing (e.g., mapping CHAT tiers into structured tables).
    • Preliminary support for paraphasia and disfluency-related annotations.
  • Batch summarization and export

    • Participant-level and group-level summary tables (e.g., lexical, syntactic, discourse measures).
    • Integration points for CoreLex counts (via RASCAL) and other domain metrics.
    • Consistent output schemas suitable for downstream statistics in R, Python, or other tools.

Installation (Early Preview)

From Github:

git clone https://github.com/nmccloskey/ALASTR.git
cd ALASTR
pip install -e .

From PyPI:

pip install alastr

You may wish to create and activate a dedicated virtual environment or conda environment before installing.


Usage (Very Early Sketch)

CLI and API interfaces are still evolving. A minimal example of the intended usage pattern might eventually look like:

alastr run \
  --config path/to/config.yaml \
  --input-transcripts path/to/cha/files \
  --output-dir path/to/output

or, in Python:

from alastr.pipeline import run_pipeline

run_pipeline(
    config_path="path/to/config.yaml",
    input_root="path/to/cha/files",
    output_root="path/to/output",
)

Exact function names and options are likely to change as the design stabilizes.


Project Status and Roadmap

ALASTR is under active development and not yet recommended for routine clinical or research deployment. Near-term goals include:

  • Stabilizing the package layout and configuration system.
  • Implementing an end-to-end demo pipeline on a small aphasia dataset.
  • Adding basic tests and continuous integration.
  • Documenting example workflows and key metrics for clinical researchers.

Citation and Contributions

A formal citation will be provided once an ALASTR methods paper is available. Until then, if you use concepts or code from this repository in academic work, please:

  • Cite the GitHub repository URL, and
  • Acknowledge ALASTR as an early-stage tool under development.

Issues, suggestions, and (well-scoped) pull requests are welcome, with the understanding that the codebase is still evolving.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alastr-0.0.1a1.tar.gz (64.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alastr-0.0.1a1-py3-none-any.whl (74.7 kB view details)

Uploaded Python 3

File details

Details for the file alastr-0.0.1a1.tar.gz.

File metadata

  • Download URL: alastr-0.0.1a1.tar.gz
  • Upload date:
  • Size: 64.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for alastr-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 ff88a22c035c9f3910a3bc47cf0c7dc400406c5db0d9c13c699669b9c7a4dc29
MD5 e2f488f6439e564183fb829e17f90304
BLAKE2b-256 c807baa68f0ae8d1761a1529192d8366e60d83bdd213b8a34d4a2fb472d116dc

See more details on using hashes here.

File details

Details for the file alastr-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: alastr-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 74.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for alastr-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 cca723ed2cd0b977fcac4a9d2ae5dc41bc4dd7242e010bcef1880601bb52f978
MD5 769b10a8b20abc713e82916082f8d337
BLAKE2b-256 b564acc4c5d67cc0068797d1252e0622c1022fd303d73a234abff92abc35a33d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page