CRATE: clinical records anonymisation and text extraction
Project description
# CRATE
**Clinical Records Anonymisation and Text Extraction (CRATE)**
## Purpose
- Anonymises relational databases.
- Operates a GATE natural language processing (NLP) pipeline.
- Includes a tool to audit all MySQL queries (with user details) via a TCP
proxy.
- Web app for
- querying the anonymised database
- managing a consent-to-contact process
## Directory structure with key files
- `anonymise/`
- **`anonymise.py`** – core program
- `launch_makedata.sh` – launcher for make_demo_database.py
- `launch_multiprocess_anonymiser.sh` – parallel processing
(multiprocess) launcher for anonymise.py
- `make_demo_database.py` – creates a demonstration database
- `test_anonymisation.py` – generates a comparison of records between
source and destination databases, to check anonymisation.
- `bug_reports/` – relating to bugs in others' code
- `built_packages/` – workspace to store new Debian package files
- **`crateweb/`** – Django web application, as above
- `ditched/` – ignored
- **`docs/`** – documentation
- `mysql_auditor/` – auditing tool for MySQL
- `mysql_auditor.conf` – sample configuration file; edit for your own
needs.
- `mysql_auditor.sh` – launcher for mysql-proxy with auditing script;
it fires up mysql-proxy (which communicates with MySQL on port A and makes
another MySQL instance appear on port B, inserting a script in between);
it stores the stdout/stderr output from the script in a disk log if
requested.
- `query_auditor_mysqlproxy.lua` – Lua script that implements the
auditor; this is used by the external mysql-proxy tool; its output is to
stdout/stderr.
- `nlp_manager/` – NLP interface tool
- `buildjava.sh` – script to compile the necessary Java source on your
machine
- `CamAnonGatePipeline.java` – Java code to interface between
nlp_manager.py (via stdin/stdout) and the Java-based external GATE tools
(via code); must be compiled before use
- `launch_multiprocess_nlp.sh` – parallel processing (multiprocess)
launcher for nlp_manager.py
- `nlp_manager.py` – core program to pipe parts of a database to a GATE
program and insert the output back into a database; uses
CamAnonGatePipeline.java to communicate with the NLP app
- `runjavademo.sh` – directly executes CamAnonGatePipeline using the
ANNIE demo GATE app, for testing
- `pythonlib/` – common RNC python libraries (a Git subtree)
- `tools/`
- **`install_virtualenv.sh`** – creates a suitable virtualenv for CRATE
- ...
- `working/` – ignored
- `changelog.Debian` – Debian package changelog and general version history
- `LICENCE` – Apache license applicable to CRATE
- `README.md` – this file
- `requirements.txt` – Python PIP requirements
- `requirements-ubuntu.txt` – Ubuntu/Debian package requirements
- `VERSION.txt` – package version number, read by package build script
## Copyright/licensing
- CRATE: copyright © 2015-2015 Rudolf Cardinal (rudolf@pobox.com).
- Licensed under the Apache License, version 2.0: see LICENSE file.
- Third-party code/libraries included:
- aspects of CamAnonGatePipeline.java are based on demonstration GATE code,
copyright © University of Sheffield, and licensed under the GNU LGPL
(which license is therefore used for npl_manager/CamAnonGatePipeline.java;
q.v.).
**Clinical Records Anonymisation and Text Extraction (CRATE)**
## Purpose
- Anonymises relational databases.
- Operates a GATE natural language processing (NLP) pipeline.
- Includes a tool to audit all MySQL queries (with user details) via a TCP
proxy.
- Web app for
- querying the anonymised database
- managing a consent-to-contact process
## Directory structure with key files
- `anonymise/`
- **`anonymise.py`** – core program
- `launch_makedata.sh` – launcher for make_demo_database.py
- `launch_multiprocess_anonymiser.sh` – parallel processing
(multiprocess) launcher for anonymise.py
- `make_demo_database.py` – creates a demonstration database
- `test_anonymisation.py` – generates a comparison of records between
source and destination databases, to check anonymisation.
- `bug_reports/` – relating to bugs in others' code
- `built_packages/` – workspace to store new Debian package files
- **`crateweb/`** – Django web application, as above
- `ditched/` – ignored
- **`docs/`** – documentation
- `mysql_auditor/` – auditing tool for MySQL
- `mysql_auditor.conf` – sample configuration file; edit for your own
needs.
- `mysql_auditor.sh` – launcher for mysql-proxy with auditing script;
it fires up mysql-proxy (which communicates with MySQL on port A and makes
another MySQL instance appear on port B, inserting a script in between);
it stores the stdout/stderr output from the script in a disk log if
requested.
- `query_auditor_mysqlproxy.lua` – Lua script that implements the
auditor; this is used by the external mysql-proxy tool; its output is to
stdout/stderr.
- `nlp_manager/` – NLP interface tool
- `buildjava.sh` – script to compile the necessary Java source on your
machine
- `CamAnonGatePipeline.java` – Java code to interface between
nlp_manager.py (via stdin/stdout) and the Java-based external GATE tools
(via code); must be compiled before use
- `launch_multiprocess_nlp.sh` – parallel processing (multiprocess)
launcher for nlp_manager.py
- `nlp_manager.py` – core program to pipe parts of a database to a GATE
program and insert the output back into a database; uses
CamAnonGatePipeline.java to communicate with the NLP app
- `runjavademo.sh` – directly executes CamAnonGatePipeline using the
ANNIE demo GATE app, for testing
- `pythonlib/` – common RNC python libraries (a Git subtree)
- `tools/`
- **`install_virtualenv.sh`** – creates a suitable virtualenv for CRATE
- ...
- `working/` – ignored
- `changelog.Debian` – Debian package changelog and general version history
- `LICENCE` – Apache license applicable to CRATE
- `README.md` – this file
- `requirements.txt` – Python PIP requirements
- `requirements-ubuntu.txt` – Ubuntu/Debian package requirements
- `VERSION.txt` – package version number, read by package build script
## Copyright/licensing
- CRATE: copyright © 2015-2015 Rudolf Cardinal (rudolf@pobox.com).
- Licensed under the Apache License, version 2.0: see LICENSE file.
- Third-party code/libraries included:
- aspects of CamAnonGatePipeline.java are based on demonstration GATE code,
copyright © University of Sheffield, and licensed under the GNU LGPL
(which license is therefore used for npl_manager/CamAnonGatePipeline.java;
q.v.).