Secure Infrastructure for Research with Administrative Data
Project description
Secure Infrastructure for Research with Administrative Data (SIRAD)
sirad
is an integration framework for data from administrative systems. It
deidentifies administrative data by removing and replacing personally
identifiable information (PII) with a global anonymized identifier, allowing
researchers to securely join data on an individual from multiple tables without
knowing the individual's identity. It is developed by
Research Improving People's Lives (RIPL).
For a worked example and further details, please see sirad-example.
To learn more about the motivation for creating this package and its potential uses, please see our article in Communications of the ACM:
J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. (2019). Unlocking Data to Improve Public Policy. Communications of the ACM 62(10): 48-53. doi:10.1145/3335150
Installation
Requires Python 3.7 or later.
To install from PyPI using pip:
pip install sirad
To install using Anaconda Python:
conda install -c ripl-org sirad
To install a development version from the current directory:
pip install -e .
Running
There is a single command line script included, sirad
.
sirad
supports the following arguments:
process
- split raw data files into data and PII filesresearch
- create a versioned set of research files with a unique anonymous identifier
Configuration
To set configuration options, create a file called sirad_config.py
and place
either in the directory where you are executing the sirad
command or
somewhere else on your Python path. See _options
in config.py
for a
complete list of possible options and default values.
The following options are available:
-
DATA_SALT
: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None. -
PII_SALT
: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None. -
LAYOUTS
: directory that contains layout files. Defaults tolayouts/
. -
RAW_DIR
,DATA_DIR
,PII_DIR
,LINK_DIR
,RESEARCH_DIR
: paths to where the original data, the processed files, and the research files will be saved. -
VERSION
: the current version number of the processed and research files.
Layout files
sirad
uses YAML files to define the layout, or structure, of raw data files.
These YAML files define each column in the incoming data and how it should be
processed. More documentation to come on this YAML format.
The following file formats are supported:
- csv - change delimiter with delimiter option
- fixed with
- xlsx (xls not currently supported)
Development
Sample test data is randomly generated using Faker; none of the information identifies real individuals.
- tax.txt - sample tax return data. Includes first, last, DOB and SSN.
- credit_scores.txt - sample credit score information. Includes first, last and DOB but no SSN.
Run unit tests as:
python -m unittest discover
sirad
can also be used as an API from custom Python code. Documentation to come.
Authors
- Mark Howison
- Ted Lawless
- John Ucles
- Preston White
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sirad-0.3.2.tar.gz
.
File metadata
- Download URL: sirad-0.3.2.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.6.4 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.7.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 21b306c43ec74254fa34aea42063e0c6812d734b50d4bebb8d1386423b26d7e0 |
|
MD5 | da7cecf6a3d3ae0028758553f121d524 |
|
BLAKE2b-256 | 7b086496e8fb3d52b80f4add7ad19d414b31eb2d68997370f61175ce29e98acf |
File details
Details for the file sirad-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: sirad-0.3.2-py3-none-any.whl
- Upload date:
- Size: 23.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6e909dc8fd40a64434a6c60d92fc1e7b079748f384aba91a8489dca5fb0f7ce8 |
|
MD5 | 66668560177806addb84ee4674469305 |
|
BLAKE2b-256 | ffe9c5ccc3d654b3e75deb0aed1b5e4f13a15faa8040222e747c76247adfda1c |