Anonymous linkage using cryptographic hashes and bloom filters
Project description
A Python (and optimised C++) implementation of anonymous linkage using cryptographic linkage keys as described by Rainer Schnell, Tobias Bachteler, and Jörg Reiher in A Novel Error-Tolerant Anonymous Linking Code.
anonlink computes similarity scores, and/or best guess matches between sets of cryptographic linkage keys (hashed entity records).
Use clkhash to create cryptographic linkage keys from personally identifiable data.
Installation
Install directly from PyPi:
pip install anonlink
Or to install from source:
pip install -r requirements.txt pip install -e .
Alternative - Manually compile the C++ library
For mac with:
g++ -std=c++11 -mssse3 -mpopcnt -O2 -Wall -pedantic -Wextra -dynamiclib -fpic -o _entitymatcher.dll dice_one_against_many.cpp
For linux with:
g++ -std=c++11 -mssse3 -mpopcnt -O2 -Wall -pedantic -Wextra -shared -fpic -o _entitymatcher.so dice_one_against_many.cpp
Benchmark
You can run the benchmark with:
$ python -m anonlink.benchmark Anonlink benchmark -- see README for explanation ------------------------------------------------ 100000 x 1024 bit popcounts Implementation | Time (ms) | Bandwidth (MiB/s) | Throughput (1e6 popc/s) Python (bitarray.count()): | 17.78 | 686.54 | 5.62 Native code (no copy): | 1.00 | 12243.76 | 100.30 Native code (w/ copy): | 344.17 | 35.47 | 0.29 (99.7% copying) Threshold: 0.5 Size 1 | Size 2 | Comparisons | Total Time (s) | Throughput | | (match %) | (comparisons / matching)| (1e6 cmp/s) -------+--------+------------------+-------------------------+------------- 1000 | 1000 | 1e6 (50.20%) | 0.249 (88.6% / 11.4%) | 4.525 2000 | 2000 | 4e6 (50.51%) | 1.069 (88.5% / 11.5%) | 4.227 3000 | 3000 | 9e6 (50.51%) | 2.412 (85.3% / 14.7%) | 4.375 4000 | 4000 | 16e6 (50.56%) | 4.316 (83.6% / 16.4%) | 4.434 Threshold: 0.7 Size 1 | Size 2 | Comparisons | Total Time (s) | Throughput | | (match %) | (comparisons / matching)| (1e6 cmp/s) -------+--------+------------------+-------------------------+------------- 1000 | 1000 | 1e6 ( 0.01%) | 0.017 (99.8% / 0.2%) | 59.605 2000 | 2000 | 4e6 ( 0.01%) | 0.056 (99.8% / 0.2%) | 71.484 3000 | 3000 | 9e6 ( 0.01%) | 0.118 (99.9% / 0.1%) | 76.500 4000 | 4000 | 16e6 ( 0.01%) | 0.202 (99.9% / 0.1%) | 79.256 5000 | 5000 | 25e6 ( 0.01%) | 0.309 (99.9% / 0.1%) | 81.093 6000 | 6000 | 36e6 ( 0.01%) | 0.435 (99.9% / 0.1%) | 82.841 7000 | 7000 | 49e6 ( 0.01%) | 0.590 (99.9% / 0.1%) | 83.164 8000 | 8000 | 64e6 ( 0.01%) | 0.757 (99.9% / 0.1%) | 84.619 9000 | 9000 | 81e6 ( 0.01%) | 0.962 (99.8% / 0.2%) | 84.358 10000 | 10000 | 100e6 ( 0.01%) | 1.166 (99.8% / 0.2%) | 85.895 20000 | 20000 | 400e6 ( 0.01%) | 4.586 (99.9% / 0.1%) | 87.334
The tables are interpreted as follows. The first section compares the bandwidth doing popcounts through (i) the Python bitarray library and (ii) a native code implementation in assembler. The latter implementation is measured in two ways: the first measures just the time taken to compute the popcounts, while the second includes the time taken to copy the data out of the running Python instance as well as copying the result back into Python. The “% copying” measure is the proportion of time spent doing this copying.
The second section includes two tables that measure the throughput of the Dice coefficient comparison function. The two tables correspond to two different choices of “matching threshold”, 0.5 and 0.7, which were chosen to characterise two different performance scenarios. Since the data used for comparisons is randomly generated, the first threshold value will cause about 50% of the candidates to “match”, while the second threshold value will cause <0.01% of the candidates to match (these values are reported in the “match %” column). In both cases, all matches above the threshold are returned and passed to the solver. In the first case, the large number of matches means that much of the time is spent keeping the candidates in order so that the top k matches can be returned. In the latter case, the tiny number of candidate matches means that the throughput is determined primarily by the comparison code itself.
Finally, the Total Time column includes indications as to the proportion of time spent calculating the (sparse) similarity matrix comparisons and the proportion of time spent matching in the greedy solver. This latter is determined by the size of the similarity matrix, which will be approximately #comparisons * match% / 100.
Tests
Run unit tests with pytest:
$ pytest ====================================== test session starts ====================================== platform linux -- Python 3.6.4, pytest-3.2.5, py-1.4.34, pluggy-0.4.0 rootdir: /home/hlaw/src/n1-anonlink, inifile: collected 71 items tests/test_benchmark.py ... tests/test_bloommatcher.py .............. tests/test_e2e.py .............ss.... tests/test_matcher.py ..x.....x......x....x.. tests/test_similarity.py ......... tests/test_util.py ... ======================== 65 passed, 2 skipped, 4 xfailed in 4.01 seconds ========================
To enable slightly larger tests add the following environment variables:
INCLUDE_10K
INCLUDE_100K
Limitations
The linkage process has order n^2 time complexity - although algorithms exist to significantly speed this up. Several possible speedups are described in Privacy Preserving Record Linkage with PPJoin.
Discussion
If you run into bugs, you can file them in our issue tracker on GitHub.
There is also an anonlink mailing list for development discussion and release announcements.
Wherever we interact, we strive to follow the Python Community Code of Conduct.
License
Copyright 2017 CSIRO (Data61)
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file anonlink-0.12.5a1.tar.gz
.
File metadata
- Download URL: anonlink-0.12.5a1.tar.gz
- Upload date:
- Size: 152.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05aa1b4eb3e499895841f405be8f36eed41d399584d6b0a4b2dd342bca0e5deb |
|
MD5 | 0705380b087660a5fa95fe1272c6c104 |
|
BLAKE2b-256 | b42fd80e8371ee105eb5bdf0317c79093a767644e4bdfec15a6929bb96e1f318 |
File details
Details for the file anonlink-0.12.5a1-cp37-cp37m-manylinux1_x86_64.whl
.
File metadata
- Download URL: anonlink-0.12.5a1-cp37-cp37m-manylinux1_x86_64.whl
- Upload date:
- Size: 576.2 kB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28a405e0282775a60e161b04c8ddf42566907877d713a9cab41ff90142587785 |
|
MD5 | 73ce4ee728fab0e127a3a5a893682027 |
|
BLAKE2b-256 | ef3b48a20e2c08f28d296ef8623ef03dccf5d6575c4d11c44eaa9bad62683835 |
File details
Details for the file anonlink-0.12.5a1-cp36-cp36m-manylinux1_x86_64.whl
.
File metadata
- Download URL: anonlink-0.12.5a1-cp36-cp36m-manylinux1_x86_64.whl
- Upload date:
- Size: 578.0 kB
- Tags: CPython 3.6m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67f514d9c184815432c6e49516f6d7172fa7a0b4f263e931393d8b6a0ed48274 |
|
MD5 | 4e534ede4ed51d15cf5f75e45191abc8 |
|
BLAKE2b-256 | 5de957c9eaa65db5d564b5c365f5ed7d54394a354378e20ee01835d8c18f9634 |
File details
Details for the file anonlink-0.12.5a1-cp35-cp35m-manylinux1_x86_64.whl
.
File metadata
- Download URL: anonlink-0.12.5a1-cp35-cp35m-manylinux1_x86_64.whl
- Upload date:
- Size: 575.7 kB
- Tags: CPython 3.5m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f22a07c2ed9f6f9ddde972cd9ba52cee5f0f16eb7eb5c52573601977bb5c0057 |
|
MD5 | 096a3c1f3f5130118d667c53133b8e27 |
|
BLAKE2b-256 | 0df77b6f7c5bd558bd72ea2f43797f25f5189f136f89dfd01d00eae5b79ac070 |