Skip to main content

Python library for implementing Special Uniques Detection Algorithm (SUDA) for measuring disclosure control risk in synthetic data

Project description

suda

Sample uniqueness scoring in Python

This is a Python library for computing sample uniques scoring using the Special Uniques Detection Algorithm (SUDA).

The algorithm looks for rows in a dataset which are unique with respect to a number of category fields and scores them according to risk.

The smaller the number of fields for which a row is unique, the higher the score. So a row which has a unique value for a single field will score highly.

The more combinations by which a row is unique the higher the score. So a row which is unique in multiple ways will score highly.

Usage

Python

Call the suda() method with the dataframe to score, the maximum MSU to test for, the DIS score for the file (defaults to 0.1) and the columns to use for scoring (defaults to all columns).

For example, calling:

results = suda(data, max_msu=2)

Will score the 'data' dataframe and find MSUs of up to two fields. If the dataframe contained fields 'gender', 'age', 'education' and 'employment' then the algorithm will look for rows that are unique for all combinations of one and two fields (gender, age, education, employment, gender & age, gender & education, gender & employment, age & education, age & employment, education & employment.)

The output may look like:

id msu suda fK fM gender region education employment dis-suda
0 0.0 0.0 2.0 0.0 female urban secondary incomplete employed 0.000000
1 0.0 0.0 2.0 0.0 female urban secondary incomplete employed 0.000000
2 1.0 12.0 1.0 4.0 female urban primary incomplete non-LF 0.020690
3 0.0 0.0 2.0 0.0 male urban secondary complete employed 0.000000
4 1.0 16.0 1.0 6.0 female rural secondary complete unemployed 0.027586
5 0.0 0.0 2.0 0.0 male urban secondary complete employed 0.000000

fK is the minimum frequency of the row - if this is >1 then there are no sample unique values for the row.

fM is the number of MSUs found for the row.

msu is the Minimum Sample Unique for the row - that is, the smallest number of fields where the row is unique.

suda is the SUDA calculated score, adding together the individual MSU scores (each MSU score is the factorial of the number of attributes in the dataset minus the MSU.)

dis-suda is the file-level risk score (DIS) divided by the total SUDA scores, multiplied by SUDA for the row. In other words, the total risk distributed by the rows.

Command line

Use the command line function to supply a CSV file for the input, a path to output the resulting CSV, the minimum MSU, the columns to include, and the file-level risk (DIS).

References

Elliot, M. J., Manning, A. M., & Ford, R. W. (2002). A Computational Algorithm for Handling the Special Uniques Problem. International Journal of Uncertainty, Fuzziness and Knowledge Based System , 10 (5), 493-509.

Elliot, M. J., Manning, A., Mayes, K., Gurd, J., & Bane, M. (2005). SUDA: A Program for Detecting Special Uniques. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality. Geneva.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

suda-0.1.9.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

suda-0.1.9-py2.py3-none-any.whl (5.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file suda-0.1.9.tar.gz.

File metadata

  • Download URL: suda-0.1.9.tar.gz
  • Upload date:
  • Size: 5.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.1

File hashes

Hashes for suda-0.1.9.tar.gz
Algorithm Hash digest
SHA256 d6aa64ad8eab0288c441ec9b7dd25c8935685d70fed8ec479c40aff5dd3a9c59
MD5 1931b6b656d1d39133572fd28aebacce
BLAKE2b-256 e715227280d9fc4410595e97b54d8104f0347059c60a1ceb3ab7397e3ee94087

See more details on using hashes here.

File details

Details for the file suda-0.1.9-py2.py3-none-any.whl.

File metadata

  • Download URL: suda-0.1.9-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.1

File hashes

Hashes for suda-0.1.9-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e59c0c7de40d7e66e6cf4ce00606ff6f9eea28d2a0e33e4f9ef69f2b0427b35b
MD5 19f97115bfaa050726cb4130ef80c8a3
BLAKE2b-256 25c10458ba943b102e76d34af04cbe3104355c3665c8e7134b7aff21118eb2e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page