Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents.
Project description
Bloatectomy
Bloatectomy: a method for the identification and removal of duplicate text in the bloated notes of electronic health records and other documents. Takes in a list of notes or a single file (.docx, .txt, .rtf, etc) or single string to be marked for duplicates. Marked output and tokens are output.
Requirements
- Python>=3.7.x (in order for the regular expressions to work correctly)
- re
- sys
- pandas (optional, only necessary if using MIMIC III data)
- docx (optional, only necessary if input or output is a word/docx file)
Installation
using anaconda or miniconda
conda install -c summerkrankin bloatectomy
using pip via PyPI
make sure to install it to python3 if your default is python2
python3 -m pip install bloatectomy
using pip via github
python3 -m pip install git+git://github.com/MIT-LCP/bloatectomy
manual install by cloning the repository
git clone git://github.com/MIT-LCP/bloatectomy
cd bloatectomy
python3 setup.py install
Examples
To run bloatectomy on a sample string with the following options:
- highlighting duplicates
- display raw results
- output file as html
- output file of numbered tokens:
from bloatectomy import bloatectomy
text = '''Assessment and Plan
61 yo male Hep C cirrhosis
Abd pain:
-other labs: PT / PTT / INR:16.6// 1.5, CK / CKMB /
ICU Care
-other labs: PT / PTT / INR:16.6// 1.5, CK / CKMB /
Assessment and Plan
'''
bloatectomy(text, style='highlight', display=True, filename='sample_txt_highlight_output', output='html', output_numbered_tokens=True)
To use with example text or load ipynb examples, download the repository or just the bloatectomy_examples folder
cd bloatectomy_examples
from bloatectomy import bloatectomy
bloatectomy('./input/sample_text.txt',
style='highlight', display=False,
filename='./output/sample_txt_highlight_output',
output='html',
output_numbered_tokens=True,
output_original_tokens=True)
Documentation
The paper is located at TBA
class bloatectomy(input_text,
path = '',
filename='bloatectomized_file',
display=False,
style='highlight',
output='html',
output_numbered_tokens=False,
output_original_tokens=False,
regex1=r"(.+?\.[\s\n]+)",
regex2=r"(?=\n\s*[A-Z1-9#-]+.*)",
postgres_engine=None,
postgres_table=None)
Parameters
input_text: file, str, list
An input document (.txt, .rtf, .docx), a string of text, or list of hadm_ids for postgres mimiciii database or the raw text.
style: str, optional, default=highlight
Method for denoting a duplicate. The following are allowed: highlight
, bold
, remov
.
filename: str, optional, default=bloatectomized_file
A string to name output file of the bloat-ectomized document.
path: str, optional, default=' '
The directory for output files.
output_numbered_tokens: bool, optional, default=False
If set to True
, a .txt file with each token enumerated and marked for duplication, is output as [filename]_token_numbers.txt
. This is useful when diagnosing your own regular expression for tokenization or testing the remov
option for style.
output_original_tokens: bool, optional, default=False
If set to True
, a .txt file with each original (non-marked) token enumerated but not marked for duplication, is output as [filename]_original_token_numbers.txt
.
display: bool, optional, default=False
If set to True
, the bloatectomized text will display in the console on completion.
regex1: str, optional, default=r"(.+?\.[\s\n]+)"
The regular expression for the first tokenization. Split on a period (.) followed by one or more white space characters (space, tab, line breaks) or a line feed character (\n
). This can be replaced with any valid regular expression to change the way tokens are created.
regex2: str, optional, default=r"(?=\n\s*[A-Z1-9#-]+.*)"
The regular expression for the second tokenization. Split on any newline character (\n
) followed by an uppercase letter, a number, or a dash. This can be replaced with any valid regular expression to change how sub-tokens are created.
postgres_engine: str, optional
The postgres connection. Only relevant for use with the MIMIC III dataset. When data is pulled from postgres the hadm_id of the file will be appended to the filename
if set or the default bloatectomized_file
. See the jupyter notebook mimic_bloatectomy_example for the example code.
postgres_table: str, optional
The name of the postgres table containing the concatenated notes. Only relevant for use with the MIMIC III dataset. When data is pulled from postgres the hadm_id of the file will be appended to the filename
if set or the default bloatectomized_file
. See the jupyter notebook mimic_bloatectomy_example for the example code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bloatectomy-0.0.12.tar.gz
.
File metadata
- Download URL: bloatectomy-0.0.12.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f84db2264a3e337b13bde94c1154e1e46f9efaa3c501b26b5ee67f095f73363 |
|
MD5 | 70c8f9c9b0bab52544cda810b2d4f89c |
|
BLAKE2b-256 | c0371a9565a8d7693c8219e5419830939f0e1f06711d4da7baf86f71418a069f |
File details
Details for the file bloatectomy-0.0.12-py3-none-any.whl
.
File metadata
- Download URL: bloatectomy-0.0.12-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6ec103edf55459e46972572d5b9cb82a08814113cd47f44417214194a9f2657 |
|
MD5 | bfbb321c6a9433d7c0e1d2d15445f877 |
|
BLAKE2b-256 | 411369842df44998b329f80287f73c33e742f428e37a0c2a96b560a62ed2c799 |