Skip to main content

Regex-driven extraction with negation for clinical text (SUD-focused).

Project description

CI License: MIT

sudregex

Version: 0.1.2

A lightweight, high-throughput pipeline for regex-driven extraction with negation and false-positive pruning—built for Substance Use Disorder (SUD) research, but flexible enough for general clinical text mining.


✨ Features

  • Negation detection – Filter matches when preceded by cues (e.g., “no”, “denies”, “not”).
  • **False-positive ** – Drop matches in noisy contexts (e.g., discharge instructions, family history).
  • Substance context window – Confirm that matches occur near a user-supplied vocabulary (e.g., opioid, alcohol terms).
  • Line-break normalization – Remove literal markers (default "$+$") and collapse whitespace.
  • Batteries included – A ready-to-use “ABC” checklist for common SUD signals.
  • CLI & Python API – Use from shell scripts or notebooks.
  • Deterministic previews – Sampling uses a fixed seed for reproducible tests.

📦 Installation

# From PyPI
pip install sud-regex


# From source (dev)
git clone https://github.com/quantitativenurse/sud-regex.git
cd sud-regex
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e .[dev]   # installs sudregex + black, isort, flake8, pytest, etc.
---

Usage

Quick Start (CLI)

sudregex --help
Run extraction (CSV with commas) using the default pruning behavior:

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active alcohol_terms,opioid_terms \
  --separator , \
  --parallel --n-workers 2

Discharge-instruction pruning

By default, sudregex excludes matches that occur in discharge-instruction contexts.

  • Default: no flag needed, or explicit:
  sudregex --extract ... --exclude-discharge-mentions

To keep discharge-context hits:

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results_raw.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active alcohol_terms \
  --include-discharge-mention

Use a custom separator (example: a unique token unlikely to appear in notes):

Clinical notes often contain commas, semicolons, tabs and other common punctuation marks as part of natural language. Using these as delimiters can lead to unintended splits and parsing errors, especially when extracting structured information from note text fields. In our work, we use the custom marker |^| because:

It is highly unlikely to appear naturally in clinical documentation. It provides a clear, unambiguous boundary between segments. It avoids conflicts with commonly used punctuation, improving extraction accuracy. It simplifies line-break normalization and downstream processing.

This choice ensures that our pipeline remains robust across diverse note formats.

sudregex --extract \
  --in_file path/to/notes.txt \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active opioid_terms \
  --separator $'|^|'    # or any safe custom delimiter

Quickstart (Python API)

import sudregex as sud

# Use the packaged defaults if desired
checklist = sud.checklist_abc
terms = sud.default_termslist

# DataFrame API
df_results = sud.extract_df(
    df=my_notes_df,                  # columns: note_id, note_text (and optional grid)
    checklist=checklist,
    termslist=terms,
    terms_active="alcohol_terms,opioid_terms",
    parallel=True,                   # <— enable parallel apply (if pandarallel is installed)
    n_workers=2,                     
    include_note_text=False,
    exclude_discharge_mentions=True, # default True; set False to disable pruning
)

# File API (CSV/TSV/…)
result = sud.extract(
    in_file="notes.csv",
    out_file="results.csv",
    checklist="path/to/checklist.py",
    separator=",",
    termslist="path/to/termslist.py",
    terms_active="opioid_terms",
    parallel=True,
    n_workers=2,                      
    include_note_text=False,
    exclude_discharge_mentions=False, # keep raw matches even in discharge contexts
)

The default checklist and termslist are available using the below method.

checklist = sud.checklist_abc

checklist

termslist = sud.default_termslist

termslist


License

MIT – see LICENSE for details.

📣 Citation / Acknowledgements

If sudregex is useful in your work, please cite:

Quantitative Nurse Lab. (2025). sudregex (Version 0.1.2). GitHub. https://github.com/quantitativenurse/sud-regex

Acknowledgements This was work was supported, in part, by the National Institute on Drug Abuse under award number DP1DA056667. The content is solely the responsibility of the authors and does not necessarily represent the official views of the US government or the National Institutes of Health.

  • Thanks to all contributors and collaborators for feedback and testing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sudregex-0.1.2.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sudregex-0.1.2-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file sudregex-0.1.2.tar.gz.

File metadata

  • Download URL: sudregex-0.1.2.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sudregex-0.1.2.tar.gz
Algorithm Hash digest
SHA256 94e8b49c4610582a0d74013cfbc01a02ac856e34de66c764bd42ad97f76b2296
MD5 8044c13c16140f64425fb5e19994333c
BLAKE2b-256 3f078ed1fac11d1068ee59c40b03129e0858408f74755cea0771c23dd8b051a8

See more details on using hashes here.

File details

Details for the file sudregex-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: sudregex-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sudregex-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 70baf4d0c977c03e5d0def14079134e87db6ef9824a2eef9b960c8cb895cb55f
MD5 340b28b5904c6636fbefd1538d6e6d74
BLAKE2b-256 7bc90f0c50d876b17b94ed85f75c110095e12785062d6e2622695bae25e686cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page