Skip to main content

Wrapper around PPRL services provided by MDS Group Leipzig

Project description

PPRL library

The pprl library provides wrappers around the PPRL REST services provided by the Medical Data Science Group Leipzig. The main entrypoints are pprl.encoder, pprl.match and pprl.broker which are all submodules for consuming the APIs of the respective services.

Documentation

The documentation of the latest commit on the master branch can be seen on GitLab.

Running tests

Run the linter in the root directory using poetry run flake8.

Navigate to the tests directory on the command line and execute docker compose up -d. This will start a number of services that are required to run the integration tests. Once they're up and running (might take a couple minutes), run the following command in the root directory of this repository.

$ PYTEST_BROKER_BASE_URL="http://localhost:8080/broker" \
    PYTEST_ENCODER_BASE_URL="http://localhost:8080/encoder" \
    PYTEST_MATCH_BASE_URL="http://localhost:8080/matcher" \
    poetry run pytest

Installation

Run pip install pprl. You can then import the pprl module in your project.

Usage

The following snippet shows how to encode an entity with specific Bloom filter encoding definitions and attribute schemas with the encoder submodule. Depending on which parameters you choose, some options may be mandatory, despite them being type hinted as optional.

from pprl import AttributeSchema, BloomFilterConfiguration, Entity
from pprl.encoder import EncoderClient

encoder = EncoderClient("http://localhost:8080/encoder")
entities = encoder.encode(
    config=BloomFilterConfiguration(
        filter_type="RBF",
        hash_strategy="RANDOM_SHA256",
        key="s3cr3t"
    ),
    schema_list=[
        AttributeSchema(
            attribute_name="name",
            data_type="string",
            average_token_count=10,
            weight=2
        ),
        AttributeSchema(
            attribute_name="age",
            data_type="integer",
            average_token_count=3,
            weight=1
        )
    ],
    entity_list=[
        Entity(id="1", attributes={
            "name": "foobar",
            "age": 42
        })
    ]
)

for entity in entities:
    print(f"{entity.id} = {entity.value}")

You can use the generated Base64-encoded bit vectors to compute their similarities to one another. You will need to make use of the match submodule.

from pprl import MatchConfiguration
from pprl.match import MatchClient

matcher = MatchClient("http://localhost:8080/matcher")
matches = matcher.match(
    config=MatchConfiguration(
        match_function="JACCARD",
        match_mode="CROSSWISE",
        threshold=0.8
    ),
    domain_list=["Zm9vYmFyCg=="],
    range_list=["Zm9vYmF6Cg=="]
)

for match in matches:
    print(f"{match.domain} => {match.range} ({round(match.similarity, 3)})")

The broker submodule is for consuming the broker service API. It is designed for massively parallel distributed record linkage. As such, the following example is a bit more complicated, but not by much. Effectively, a new session is created. Two clients will join the session, submit their bit vectors and receive their results eventually.

import time

from pprl import BitVector, BitVectorMetadata, BitVectorMetadataSpecification, MatchConfiguration
from pprl.broker import BrokerClient

broker = BrokerClient("http://localhost:8080/broker")

# we can discard the second argument since we won't receive any cancellation arguments
# from the "simple" cancellation strategy
session_secret, _ = broker.create_session(
    config=MatchConfiguration(
        match_function="JACCARD",
        threshold=0.8
    ),
    session_cancellation="SIMPLE",
    metadata_specifications=[
        BitVectorMetadataSpecification(
            name="createdAt",
            data_type="datetime",
            decision_rule="keepLatest"
        )
    ]
)

# we create two clients identified by different secrets
client_1_secret = broker.create_client(session_secret)
client_2_secret = broker.create_client(session_secret)

broker.submit_bit_vectors(client_1_secret, [
    BitVector(
        id="1",
        value="Zm9vYmFyCg==",
        metadata=[
            BitVectorMetadata(
                name="createdAt", 
                value="2022-06-21T10:24:36+02:00"
            )
        ]
    )
])

broker.submit_bit_vectors(client_2_secret, [
    BitVector(
        id="2",
        value="Zm9vYmF6Cg==",
        metadata=[
            BitVectorMetadata(
                name="createdAt", 
                value="2022-06-21T10:25:25+02:00"
            )
        ]
    )
])

# wait for matching to finish and check back every second
while broker.get_session_progress(session_secret) < 1:
    time.sleep(1)

# now print out the results for every client
for client_secret in (client_1_secret, client_2_secret):
    print(f"matches for client {client_secret}")

    for match in broker.get_results(client_secret):
        print(f"  {match.vector.id} ({round(match.similarity, 3)})")

# finally, cancel the session
broker.cancel_session(session_secret)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pprl-0.3.1.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

pprl-0.3.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file pprl-0.3.1.tar.gz.

File metadata

  • Download URL: pprl-0.3.1.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.5 Linux/5.15.50-1-lts

File hashes

Hashes for pprl-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b8928fe74663e6f4dbc83a55684ea02c35f6a4ee4929ad3b879bffbbb2a0239a
MD5 4f2fdd8d83e1d5b928ffba4b6b06693e
BLAKE2b-256 c50d1a9ca4d5d912d0db28eb0ae7158f5f016f64ae751bfc7d01a71eb3866097

See more details on using hashes here.

File details

Details for the file pprl-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pprl-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.5 Linux/5.15.50-1-lts

File hashes

Hashes for pprl-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cad6ec69a6e04b2ce0b020977ede0a9abfcede016e23ed4a6f5bc97ca16b0378
MD5 30505280147ef2671fb01e7f8b6e0f8d
BLAKE2b-256 7f32de25c7ffa540e65f3b9b59baee660f079c928daf66e3376f39551193b3b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page