Skip to main content

Comprehensive Linguistic Analysis of Text for Research

Project description

CLATR - Comprehensive Linguistic Analysis of Text for Research

CLATR Status Notice

CLATR is currently in a developmental and transitional state (version 0.0.1a1). This repository reflects an early-stage general linguistic analysis prototype that originated as an independent research project. While the architecture remains of interest, the active line of development has now shifted to ALASTR, a specialized speech transcript-focused fork designed to meet domain-specific needs.

Overview

CLATR is a integrative Python pipeline designed for linguistic analysis of textual data, providing detailed insights for research and analysis. It facilitates preprocessing, multiple specialized linguistic analyses, and comprehensive output management, including aggregation, comparison, clustering, and EDA capabilities.

Features

  • Preprocessing: Tokenization and structuring of input text data
  • Sentence/Document Level: Controlled by sentence_level setting
  • Output Options: Raw tables, aggregated tables, clustering, visualizations
  • Configurable Sections: Enable/disable individual analyses via settings
    • Graphemes
    • Lexicon
    • Morphology
    • Syntax
    • Phonology
    • Semantics
    • Mechanics

How It Works

  1. Initialization

    • OutputManager reads settings and prepares output tables
    • PipelineManager sets up selected analysis modules
  2. Preprocessing

    • Input .cha files are parsed, speaker turns cleaned, and sentence/doc-level samples created
  3. Analysis Pipeline

    • For each selected section:
      • Raw tables are created per granularity (doc/sent)
      • Each sample is processed and results collected
      • Data is written to Excel, optionally clustered and aggregated
      • Visualizations are generated
  4. Output

    • Excel files saved under /output/<section>/<granularity>
    • Clustering, aggregation, and visualizations are optional

Installation

We recommend installing CLATR into a dedicated virtual environment using Anaconda:

1. Create and activate your environment:

conda create --name clatr python=3.12
conda activate clatr

2. Install CLATR from GitHub:

pip install git+https://github.com/nmccloskey/clatr.git@main

or from PyPI:

pip install clatr

Setup

To prepare for running CLATR, complete the following steps:

1. Create your working directory:

We recommend creating a fresh project directory where you'll run your analysis.

Example structure:

your_project/
├── config.yaml           # Configuration file (see below)
└── data/
    └── input/            # Place your CHAT (.cha) files and/or Excel data here
                          # (CLATR will make output and sqlite database directories)

2. Provide a config.yaml file

This file specifies the directories, selected analysis sections, and tier structure.

You can download the example config file from the repo or create your own like this:

# Identify directories.
input_dir: "clatr_data/input"
output_dir: "clatr_data/output"
output_label: "test"
database_dir: "clatr_data/database"

# Control tabular output,
cluster: False
aggregate: False
compare_groups: False

# and visual output.
visualize: False
cohen_d_threshold: 0.8
max_feature_visuals: 5

# Designate groupings.
tiers: {
    site: {partition: False, regex: AC|BU|TU},
    test: {partition: False, regex: Pre|Post|Maint},
    participantID: {partition: False, regex: (AC|BU|TU)\d+},
    narrative: {partition: False, regex: CATGrandpa|BrokenWindow|RefusedUmbrella|CatRescue|BirthdayScene}
}

# Group by each tier and each combination.
comparison_combos: [
    [test],
    [narrative],
]

all_comparison_combos: False
compare_with_clusters: False

# Group by each tier and each combination.
aggregation_combos: [
    [site],
    [test],
    [narrative],
    [participantID],
    [test, narrative],
    [test, participantID]
]

all_aggregation_combos: False
aggregate_with_clusters: False

## CLATR-specific:
# Specify granularity.
sentence_level: False

# Select analyses.
sections: {
    graphemes: False,
    lexicon: True,
    morphology: False,
    syntax: False,
    phonology: False,
    semantics: False,
    mechanics: False
}

ngrams: 5

dep_trees: False

# .cha files
exclude_speakers: [INV]

Running the Program

Once installed, CLATR can be run from any directory using the command-line interface:

clatr

Status and Contact

This tool is released as a public beta version and is still under active development. While the core functionality is stable and has been used in research contexts, there are aspects of robustness, error handling, and user-friendliness which still want refinement.

I warmly welcome feedback, feature suggestions, or bug reports. Feel free to reach out by:

  • Submitting an issue through the GitHub Issues tab

  • Emailing me directly at: nsm [at] temple.edu

Thanks for your interest and collaboration!

Repository Notes

This repository reflects a clean reinitialization of the development history as of April 2025. Earlier commits were removed to:

  1. Respect data privacy for sensitive clinical transcript content, even though all .cha files used during development were de-identified
  2. Eliminate unnecessary storage of output, logs, and database files that were not properly excluded in the previous .gitignore

No core functionality or implementation history has been lost, and the full pipeline has been preserved in its final state. All future development will follow a transparent version-controlled workflow.

Citation

If using CLATR in your research, please cite:

McCloskey, N., et al. (2025, April). The RASCAL pipeline: User-friendly and time-saving computational resources for coding and analyzing language samples. Poster presented at the Aphasia Access Leadership Summit, Pittsburgh, PA.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clatr-0.0.1a1.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clatr-0.0.1a1-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file clatr-0.0.1a1.tar.gz.

File metadata

  • Download URL: clatr-0.0.1a1.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for clatr-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 0883f99437d04c1278d0aabf1f552ae349b3b1db2ab5bec32b62c5fe12d96bf5
MD5 8f7611943bc3f51fbf05a8f412407778
BLAKE2b-256 62c994b1fd3ab42203a227651600ac2d9ebf1203bc317386d17b83e2e09857f2

See more details on using hashes here.

File details

Details for the file clatr-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: clatr-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for clatr-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 c16669c13dd20cc2a52e5244ea6c861beb99ef7c7167ef81c5e332834d2d4fb6
MD5 38412deff83ae2728a291b5a0b796b8c
BLAKE2b-256 09ef865abc9d37886829614c452b5840500d39f63b15ce9a52687c464baeb0d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page