Wrapper around PPRL services provided by MDS Group Leipzig
Project description
PPRL library
The pprl
library provides wrappers around the PPRL REST services provided by the Medical Data Science Group Leipzig.
The main entrypoints are pprl.encoder
, pprl.match
and pprl.broker
which are all submodules for consuming the APIs of the respective services.
Documentation
The documentation of the latest commit on the master
branch can be seen on GitLab.
Running tests
Run the linter in the root directory using poetry run flake8
.
Navigate to the tests directory on the command line and execute docker compose up -d
.
This will start a number of services that are required to run the integration tests.
Once they're up and running (might take a couple minutes), run the following command in the root directory of this repository.
$ PYTEST_BROKER_BASE_URL="http://localhost:8080/broker" \
PYTEST_ENCODER_BASE_URL="http://localhost:8080/encoder" \
PYTEST_MATCH_BASE_URL="http://localhost:8080/matcher" \
poetry run pytest
Installation
Run pip install pprl
.
You can then import the pprl
module in your project.
Usage
The following snippet shows how to encode an entity with specific Bloom filter encoding definitions and attribute schemas with the encoder
submodule.
Depending on which parameters you choose, some options may be mandatory, despite them being type hinted as optional.
from pprl import AttributeSchema, BloomFilterConfiguration, Entity
from pprl.encoder import EncoderClient
encoder = EncoderClient("http://localhost:8080/encoder")
entities = encoder.encode(
config=BloomFilterConfiguration(
filter_type="RBF",
hash_strategy="RANDOM_SHA256",
key="s3cr3t"
),
schema_list=[
AttributeSchema(
attribute_name="name",
data_type="string",
average_token_count=10,
weight=2
),
AttributeSchema(
attribute_name="age",
data_type="integer",
average_token_count=3,
weight=1
)
],
entity_list=[
Entity(id="1", attributes={
"name": "foobar",
"age": 42
})
]
)
for entity in entities:
print(f"{entity.id} = {entity.value}")
You can use the generated Base64-encoded bit vectors to compute their similarities to one another.
You will need to make use of the match
submodule.
from pprl import MatchConfiguration
from pprl.match import MatchClient
matcher = MatchClient("http://localhost:8080/matcher")
matches = matcher.match(
config=MatchConfiguration(
match_function="JACCARD",
match_mode="CROSSWISE",
threshold=0.8
),
domain_list=["Zm9vYmFyCg=="],
range_list=["Zm9vYmF6Cg=="]
)
for match in matches:
print(f"{match.domain} => {match.range} ({round(match.similarity, 3)})")
The broker
submodule is for consuming the broker service API.
It is designed for massively parallel distributed record linkage.
As such, the following example is a bit more complicated, but not by much.
Effectively, a new session is created.
Two clients will join the session, submit their bit vectors and receive their results eventually.
import time
from pprl import BitVector, BitVectorMetadata, BitVectorMetadataSpecification, MatchConfiguration
from pprl.broker import BrokerClient
broker = BrokerClient("http://localhost:8080/broker")
# we can discard the second argument since we won't receive any cancellation arguments
# from the "simple" cancellation strategy
session_secret, _ = broker.create_session(
config=MatchConfiguration(
match_function="JACCARD",
threshold=0.8
),
session_cancellation="SIMPLE",
metadata_specifications=[
BitVectorMetadataSpecification(
name="createdAt",
data_type="datetime",
decision_rule="keepLatest"
)
]
)
# we create two clients identified by different secrets
client_1_secret = broker.create_client(session_secret)
client_2_secret = broker.create_client(session_secret)
broker.submit_bit_vectors(client_1_secret, [
BitVector(
id="1",
value="Zm9vYmFyCg==",
metadata=[
BitVectorMetadata(
name="createdAt",
value="2022-06-21T10:24:36+02:00"
)
]
)
])
broker.submit_bit_vectors(client_2_secret, [
BitVector(
id="2",
value="Zm9vYmF6Cg==",
metadata=[
BitVectorMetadata(
name="createdAt",
value="2022-06-21T10:25:25+02:00"
)
]
)
])
# wait for matching to finish and check back every second
while broker.get_session_progress(session_secret) < 1:
time.sleep(1)
# now print out the results for every client
for client_secret in (client_1_secret, client_2_secret):
print(f"matches for client {client_secret}")
for match in broker.get_results(client_secret):
print(f" {match.vector.id} ({round(match.similarity, 3)})")
# finally, cancel the session
broker.cancel_session(session_secret)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pprl-0.3.1.tar.gz
.
File metadata
- Download URL: pprl-0.3.1.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.5 Linux/5.15.50-1-lts
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8928fe74663e6f4dbc83a55684ea02c35f6a4ee4929ad3b879bffbbb2a0239a |
|
MD5 | 4f2fdd8d83e1d5b928ffba4b6b06693e |
|
BLAKE2b-256 | c50d1a9ca4d5d912d0db28eb0ae7158f5f016f64ae751bfc7d01a71eb3866097 |
File details
Details for the file pprl-0.3.1-py3-none-any.whl
.
File metadata
- Download URL: pprl-0.3.1-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.5 Linux/5.15.50-1-lts
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cad6ec69a6e04b2ce0b020977ede0a9abfcede016e23ed4a6f5bc97ca16b0378 |
|
MD5 | 30505280147ef2671fb01e7f8b6e0f8d |
|
BLAKE2b-256 | 7f32de25c7ffa540e65f3b9b59baee660f079c928daf66e3376f39551193b3b3 |