A python package for extracting electronic health transcripts , and then classifying them based on human annotated data.

These details have not been verified by PyPI

Project description

pytranscripts

An Open source👨‍🔧 Python Library for Automated classification of Electronic Medical records

Installation

To install the latest version , simply use

pip install -U pytranscripts

Pipeline Summary

pipeline image

Stages

Data Extraction
Target Identification
Finetuning Annotated Data on Pretrained models (Bert & Electra)
Extracting Interviwer/Interviewee records from the specified docx file storage
Metrics Evaluation (Accuracy & Cohen Kappa Score)
Reordering records as a neatly arranged and flagged spreadsheet, alongside metrics and reports from pretrained models.

Example Usage

Mount Google Drive (Optional)

If using Google Drive as the data source:

from google.colab import drive
drive.mount('/content/drive')

Automate Data Export

To export and combine all .docx files from a folder into a single file:

from pytranscripts import export_docx_from_folder

# Define input and output paths
INPUT_FOLDER = "/content/drive/MyDrive/Your/Path/To/Dataset/"
OUTPUT_FILE = "output.csv"

# Define labels for structured data
LABELS = [
    'Value equation',
    'Credentialing / Quality Assurance Infrastructure',
    'Financial Impact',
    'Health System Characteristics',
    'Clinical utility & efficiency - Provider perspective',
    'Workflow related problems',
    'Provider Characteristics',
    'Training',
    'Patient/Physician interaction in LUS',
    'Imaging modalities in general',
]

# Export data
export_docx_from_folder(
    input_directory=INPUT_FOLDER,
    output_file=OUTPUT_FILE,
    labels=LABELS
)

This will:

Read all .docx files from INPUT_FOLDER.
Combine their content into a single file.
Apply the defined labels to create a structured dataset.

Requirements

Python 3.6 or later GPU access recommended for optimal performance (if using Jupyter Notebook). pytranscripts version 1.2.4 or higher.

Model Training

Now , the detailed class shows how to properly use our transcript trainer in making training and inference easy based on your document

from pytranscripts import TranscriptTrainer


trainer = TranscriptTrainer(
    input_file='/content/drive/MyDrive/Kalu+Deola/OLD NLP/CompletedMerged.xlsx',  # Path to the CSV / XLSX file containing the tagged documents. This is the main data source for training and evaluation.

    destination_path='/content/',  # Directory where all the training results, models, and logs will be saved. , We are using colab path to make things seamless

    text_column='full_quote',  # Specifies the column name in the CSV file that contains the text data to be used for training.

    test_size=0.2,  # Determines the fraction of the data that will be used for testing the model, instead of training it. Here, 20% of data will be used for testing.

    max_length=512, #The maximum number of tokens to include in each input sequence, this helps in managing memory and computational resources. Sequences longer than this will be truncated.

    num_train_epochs=1, # The number of times the model will iterate over the entire training dataset during training. More epochs will mean more training.

    learning_rate_distilbert=2e-5, # Learning rate for the DistilBERT model. This controls the step size during model optimization, lower values mean smaller updates to the model.

    learning_rate_electra=3e-5,  # Learning rate for the Electra model.  This controls the step size during model optimization, lower values mean smaller updates to the model.

    labels=[ # A list of labels used for the multi-label classification task. Each label corresponds to a category the model will try to identify in the text.
            'Value equation',
            'Credentialing / Quality Assurance Infrastructure',
            'Financial Impact',
            'Health System Characteristics',
            'Clinical utility & efficiency-Provider perspective',
            'Workflow related problems',
            'Provider Characteristics',
            'Training',
            'Patient/Physician interaction in LUS',
            'Imaging modalities in general'
    ], # PLEASE MAKE SURE THAT THE LIST YOU ARE GOING TO BE USING HERE MATCHES THE ONE IN YOUR INPUT FILE


    upper_lower_mapping = { # Dictionary for mapping high level categories to lower level categories
        "multi_level_org_char": [ #High level category name
            "Provider Characteristics", #lower level category names
            "Health System Characteristics" #lower level category names
        ],
        "multi_level_org_perspect": [ #High level category name
            "Imaging modalities in general", #lower level category names
            'Value equation', #lower level category names
            "Clinical utility & efficiency-Provider perspective", #lower level category names
            "Patient/Physician interaction in LUS", #lower level category names
            'Workflow related problems' #lower level category names
        ],
        "impl_sust_infra": [ #High level category name
            "Training",  #lower level category names
            'Credentialing / Quality Assurance Infrastructure', #lower level category names
            "Financial Impact"  #lower level category names
        ]
    }
)

Contributing

We welcome contributions! Please follow the contributing guidelines.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.5.1

May 12, 2025

1.5.0

May 12, 2025

1.4.2

Mar 7, 2025

1.4.1

Feb 12, 2025

1.4.0

Feb 12, 2025

1.3.9

Feb 12, 2025

1.3.8

Feb 11, 2025

1.3.7

Feb 11, 2025

1.3.6

Feb 11, 2025

1.3.5

Feb 11, 2025

1.3.4

Feb 11, 2025

1.3.3

Feb 11, 2025

1.3.1

Feb 11, 2025

This version

1.3.0

Feb 10, 2025

1.2.10

Feb 9, 2025

1.2.9

Feb 8, 2025

1.2.8

Feb 8, 2025

1.2.7

Feb 8, 2025

1.2.6

Jan 20, 2025

1.2.5

Jan 20, 2025

1.2.4

Jan 10, 2025

1.2.3

Jan 10, 2025

1.2.1

Jan 9, 2025

1.2.0

Jan 3, 2025

1.1.0

Jan 3, 2025

1.0.0

Dec 1, 2024

0.2.5

Dec 1, 2024

0.2.4

Dec 1, 2024

0.2.3

Nov 19, 2024

0.2.2

Oct 6, 2024

0.2.1

Oct 6, 2024

0.2.0

Oct 6, 2024

0.1.0

Oct 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytranscripts-1.3.0.tar.gz (15.4 kB view details)

Uploaded Feb 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytranscripts-1.3.0-py3-none-any.whl (16.9 kB view details)

Uploaded Feb 10, 2025 Python 3

File details

Details for the file pytranscripts-1.3.0.tar.gz.

File metadata

Download URL: pytranscripts-1.3.0.tar.gz
Upload date: Feb 10, 2025
Size: 15.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for pytranscripts-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`e8eea69c753838b6be49892a95f2a79069fd877764e74e06163da859607117a0`
MD5	`2705b4267526517aa732a18d04b0d055`
BLAKE2b-256	`147cce24951dd1cb35081769923d127910df58ca5a5fe28d3abbaeab0f48062b`

See more details on using hashes here.

File details

Details for the file pytranscripts-1.3.0-py3-none-any.whl.

File metadata

Download URL: pytranscripts-1.3.0-py3-none-any.whl
Upload date: Feb 10, 2025
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for pytranscripts-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`108c1417c55b10ccd2cb466d152d397628a4aaec3d5e74feccb4250252814843`
MD5	`445a9a7f30448f71076f38260d18082d`
BLAKE2b-256	`063ca10f40648d7dac90d718c7702bf22cb895536248f3383ce9b2d42c992051`

See more details on using hashes here.

pytranscripts 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pytranscripts

Installation

Pipeline Summary

Stages

Example Usage

Mount Google Drive (Optional)

Automate Data Export

Requirements

Model Training

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes