Skip to main content

Secure Infrastructure for Research with Administrative Data

Project description

Secure Infrastructure for Research with Administrative Data (SIRAD)

sirad is an integration framework for data from administrative systems. It deidentifies administrative data by removing and replacing personally identifiable information (PII) with a global anonymized identifier, allowing researchers to securely join data on an individual from multiple tables without knowing the individual's identity. It is developed by Research Improving People's Lives (RIPL).

For a worked example and further details, please see sirad-example.

To learn more about the motivation for creating this package and its potential uses, please see our article in Communications of the ACM:

J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. (2019). Unlocking Data to Improve Public Policy. Communications of the ACM 62(10): 48-53. doi:10.1145/3335150

Installation

Requires Python 3.7 or later.

To install from PyPI using pip:
pip install sirad

To install using Anaconda Python:
conda install -c ripl-org sirad

To install a development version from the current directory:
pip install -e .

Running

There is a single command line script included, sirad.

sirad supports the following arguments:

  • process - split raw data files into data and PII files
  • research - create a versioned set of research files with a unique anonymous identifier

Configuration

To set configuration options, create a file called sirad_config.py and place either in the directory where you are executing the sirad command or somewhere else on your Python path. See _options in config.py for a complete list of possible options and default values.

The following options are available:

  • DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None.

  • PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None.

  • LAYOUTS: directory that contains layout files. Defaults to layouts/.

  • RAW_DIR, DATA_DIR, PII_DIR, LINK_DIR, RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved.

  • VERSION: the current version number of the processed and research files.

Layout files

sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed. More documentation to come on this YAML format.

The following file formats are supported:

  • csv - change delimiter with delimiter option
  • fixed with
  • xlsx (xls not currently supported)

Development

Sample test data is randomly generated using Faker; none of the information identifies real individuals.

  • tax.txt - sample tax return data. Includes first, last, DOB and SSN.
  • credit_scores.txt - sample credit score information. Includes first, last and DOB but no SSN.

Run unit tests as:

python -m unittest discover

sirad can also be used as an API from custom Python code. Documentation to come.

Authors

  • Mark Howison
  • Ted Lawless
  • John Ucles
  • Preston White

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sirad-0.3.2.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

sirad-0.3.2-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file sirad-0.3.2.tar.gz.

File metadata

  • Download URL: sirad-0.3.2.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.6.4 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.7.8

File hashes

Hashes for sirad-0.3.2.tar.gz
Algorithm Hash digest
SHA256 21b306c43ec74254fa34aea42063e0c6812d734b50d4bebb8d1386423b26d7e0
MD5 da7cecf6a3d3ae0028758553f121d524
BLAKE2b-256 7b086496e8fb3d52b80f4add7ad19d414b31eb2d68997370f61175ce29e98acf

See more details on using hashes here.

File details

Details for the file sirad-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: sirad-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for sirad-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6e909dc8fd40a64434a6c60d92fc1e7b079748f384aba91a8489dca5fb0f7ce8
MD5 66668560177806addb84ee4674469305
BLAKE2b-256 ffe9c5ccc3d654b3e75deb0aed1b5e4f13a15faa8040222e747c76247adfda1c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page