Regex-driven extraction with negation for clinical text (SUD-focused).

These details have not been verified by PyPI

Project links

Project description

sudregex

Version: 0.1.6

A lightweight, high-throughput pipeline for regex-driven extraction with negation and false-positive pruning. It was developed for Substance Use Disorder (SUD) research, but the core extraction workflow is flexible enough for broader clinical text mining use cases.

✨ Features

Unified gating utilities for substance context, negation, common false-positive pruning, and discharge-context filtering
Configurable negation scope with left (default), right, or both
Substance-context gating to require matches near a user-supplied vocabulary
Deterministic, gated previews that only show rows passing all configured gates
Notebook-friendly preview output via previews_df
Line-break normalization with whitespace cleanup
Packaged defaults including an ABC checklist and default term lists
CLI and Python APIs for shell workflows and notebook use
Multiple parallel backends with support for pandarallel and loky

📦 Installation

From PyPI

pip install sud-regex

From source

git clone https://github.com/quantitativenurse/sud-regex.git
cd sud-regex
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .[dev]

This installs sudregex along with development tools such as black, isort, flake8, and pytest.

Windows setup

git clone https://github.com/quantitativenurse/sud-regex.git
cd sud-regex
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install -U pip
pip install -e .[dev]

Identifier columns

Your input data does not need to follow OMOP naming conventions. You can map your own identifiers with:

--person-column
--note-id-column

You can also pass extra identifier columns through the pipeline when needed.

Usage

For interactive notebook usage, see the tutorial:

sudregex_tutorial_notebook.ipynb

Quick Start (CLI)

Show help:

sudregex --help

Run extraction on a comma-delimited file:

macOS / Linux

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active alcohol_terms,opioid_terms \
  --separator , \
  --parallel \
  --n-workers 2 \
  --negation-scope left \
  --exclude-discharge-mentions

Windows PowerShell

sudregex --extract `
  --in_file path/to/notes.csv `
  --out_file path/to/results.csv `
  --checklist path/to/checklist.py `
  --termslist path/to/termslist.py `
  --terms_active alcohol_terms,opioid_terms `
  --separator , `
  --parallel `
  --n-workers 2 `
  --negation-scope left `
  --exclude-discharge-mentions

Parallel backends

sudregex supports two parallel backends:

pandarallel
loky

Example with Pandarallel:

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active opioid_terms \
  --parallel \
  --parallel-backend pandarallel \
  --n-workers 4

Example with Loky:

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active opioid_terms \
  --parallel \
  --parallel-backend loky \
  --n-workers 4

Files without headers

If your input file does not contain a header row, use --no-header and provide column names in file order:

macOS / Linux

sudregex --extract \
  --in_file path/to/notes.txt \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active opioid_terms \
  --separator $'\t!\\^!\t' \
  --no-header \
  --columns patient_id,note_id,note_text

Windows PowerShell

Because PowerShell does not support Bash ANSI-C quoting, pass the escaped regex directly:

sudregex --extract `
  --in_file path/to/notes.txt `
  --out_file path/to/results.csv `
  --checklist path/to/checklist.py `
  --termslist path/to/termslist.py `
  --terms_active opioid_terms `
  --separator '\t!\^!\t' `
  --no-header `
  --columns patient_id,note_id,note_text

Discharge-instruction pruning

By default, sudregex excludes matches found in discharge-instruction contexts.

Use the default behavior explicitly:

sudregex --extract ... --exclude-discharge-mentions

To keep discharge-context hits:

sudregex --extract \
  --in_file path/to/notes.csv \
  --out_file path/to/results_raw.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active alcohol_terms \
  --include-discharge-mentions

Custom separators

Clinical notes often contain commas, tabs, semicolons, and other punctuation as part of normal text. For text-based input files, using a custom delimiter can make parsing more reliable.

A custom separator is useful when:

the note text contains commas or tabs
standard delimiters create parsing ambiguity
you want a delimiter that is unlikely to appear in clinical text

Example using a custom token:

sudregex --extract \
  --in_file path/to/notes.txt \
  --out_file path/to/results.csv \
  --checklist path/to/checklist.py \
  --termslist path/to/termslist.py \
  --terms_active opioid_terms \
  --separator '\\|\\^\\|'

If your separator contains regex-special characters, remember that pandas.read_csv(..., engine="python") treats sep as a regular expression. Escape it accordingly.

For tab-delimited custom markers such as \t!^!\t, use:

macOS / Linux

--separator $'\t!\\^!\t'

Windows PowerShell

--separator '\t!\^!\t'

Quick Start (Python API)

import sudregex as sud

# Packaged defaults
checklist = sud.checklist_abc
terms = sud.default_termslist

# In-memory DataFrame API
result_df, previews_df = sud.extract_df(
    df=my_notes_df,                          # requires note text and note identifier columns
    checklist=checklist,                    # dict or path to checklist.py
    termslist=terms,                        # dict, module, or path to termslist.py
    terms_active="alcohol_terms,opioid_terms",
    person_column="patient_sid",            # optional person identifier
    id_column="doc_oid",                    # optional note/document identifier
    include_note_text=True,
    exclude_discharge_mentions=True,
    preview_count=10,
    preview_span=120,
    negation_scope="left",
    parallel=True,
    parallel_backend="loky",
    n_workers=4,
    debug=False,
    return_previews_df=True,
)

# Preview columns:
# item_key, note_id, span_start, span_end, snippet, snippet_marked
print(previews_df.head())

Example of filtering previews for a single checklist item:

previews_df.query("item_key == 'cocaine_mention'")[["note_id", "snippet_marked"]].head(10)

Example of joining one preview per note back to the main result:

one_preview = (
    previews_df.groupby("note_id").first().reset_index()[["note_id", "snippet_marked"]]
)
result_with_preview = result_df.merge(one_preview, on="note_id", how="left")

File-based API

import sudregex as sud

sud.extract(
    in_file="notes.csv",
    out_file="results.csv",
    checklist="path/to/checklist.py",
    separator=",",
    termslist="path/to/termslist.py",
    terms_active="opioid_terms",
    include_note_text=False,
    exclude_discharge_mentions=False,
    preview_count=10,
    preview_file="note_previews.txt",
    preview_csv="previews.csv",
    negation_scope="both",
    parallel=True,
    parallel_backend="pandarallel",
    n_workers=4,
)

Packaged defaults

The package includes a default checklist and default term lists:

import sudregex as sud

checklist = sud.checklist_abc
termslist = sud.default_termslist

Output naming behavior

When using extract() with chunked input:

if exactly one result batch is produced, output is written to the requested out_file
if multiple batches are produced, output is written as numbered part files such as:

results_part_0.csv
results_part_1.csv
results_part_2.csv

Changelog highlights

0.1.6

Added support for multiple parallel backends
Added loky backend for cross-platform parallel execution
Preserved identical output across serial, Pandarallel, and Loky workflows
Improved input handling for headerless files and custom separators

0.1.5

Unified gating utilities for substance, negation, common false positives, and discharge filtering
Added negation_scope with left, right, and both
Added in-memory preview support with extract_df(..., return_previews_df=True)
Added highlighted preview output via snippet_marked
Improved dtype normalization and error handling

License

MIT — see LICENSE for details.

📣 Citation / Acknowledgements

If sudregex is useful in your work, please cite:

Quantitative Nurse Lab. (2025). sudregex (Version 0.1.6). GitHub. https://github.com/quantitativenurse/sud-regex

Acknowledgements

This work was supported, in part, by the National Institute on Drug Abuse under award number DP1DA056667. The content is solely the responsibility of the authors and does not necessarily represent the official views of the U.S. Government or the National Institutes of Health.

Thanks to all contributors and collaborators for feedback and testing.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Mar 12, 2026

0.1.5

Nov 25, 2025

0.1.4

Oct 27, 2025

0.1.3

Oct 27, 2025

0.1.2

Sep 25, 2025

0.1.1

Sep 25, 2025

0.1.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sudregex-0.1.6.tar.gz (36.9 kB view details)

Uploaded Mar 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sudregex-0.1.6-py3-none-any.whl (34.7 kB view details)

Uploaded Mar 12, 2026 Python 3

File details

Details for the file sudregex-0.1.6.tar.gz.

File metadata

Download URL: sudregex-0.1.6.tar.gz
Upload date: Mar 12, 2026
Size: 36.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sudregex-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`6768e5e0d85f845c62018ecba979649b900625e92d775b82a93ac2600cce53d8`
MD5	`b676f1c7770b8210493fdb58dd5b2150`
BLAKE2b-256	`66402f1622281fe6f20544309459d2d779705eb487f421a935a67505cb850240`

See more details on using hashes here.

File details

Details for the file sudregex-0.1.6-py3-none-any.whl.

File metadata

Download URL: sudregex-0.1.6-py3-none-any.whl
Upload date: Mar 12, 2026
Size: 34.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sudregex-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6734dbdaff2f8fc5b1561175630645817f2aae9ee48d2d19ea322dda491ed9a`
MD5	`54b042ec5accb32b69b222f31c6b0d61`
BLAKE2b-256	`742389ffd8435b45e35478ff2862757ff472a14da2b148ae7193d7d88ced8905`

See more details on using hashes here.

sudregex 0.1.6

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

sudregex

✨ Features

📦 Installation

From PyPI

From source

Windows setup

Identifier columns

Usage

Quick Start (CLI)

macOS / Linux

Windows PowerShell

Parallel backends

Files without headers

macOS / Linux

Windows PowerShell

Discharge-instruction pruning

Custom separators

macOS / Linux

Windows PowerShell

Quick Start (Python API)

File-based API

Packaged defaults

Output naming behavior

Changelog highlights

0.1.6

0.1.5

License

📣 Citation / Acknowledgements

Acknowledgements

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes