Skip to main content

oagdedupe is a Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches.

Project description

oagdedupe

oagdedupe is a Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches.

page contents

Documentation

You can find the documentation of oagdedupe at https://deduper.readthedocs.io/en/latest/, where you can find the api reference, guide to methodology, and examples.

Installation

[tbd pip install instructions]

start label-studio

Start label-studio using docker command below, updating [LS_PORT] to the port on your host machine

docker run -it -p [LS_PORT]:8080 -v `pwd`/cache/mydata:/label-studio/data \
	--env LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true \
	--env LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/files \
	-v `pwd`/cache/myfiles:/label-studio/files \
	heartexlabs/label-studio:latest label-studio

postgres

[insert instructions here about initializing postgres]

most importantly, need to create functions (dedupe/postgres/funcs.py)

project settings

Make a dedupe.settings.Settings object. For example:

from oagdedupe.settings import (
    Settings,
    SettingsOther,
)

settings = Settings(
    name="default",  # the name of the project, a unique identifier
    folder="./.dedupe",  # path to folder where settings and data will be saved
    other=SettingsOther(
        n=5000, # active-learning samples per learning loop
        k=3, # max_len of block conjunctions
        cpus=20,  # parallelize distance computations
        attributes=["givenname", "surname", "suburb", "postcode"],  # list of entity attribute names
        path_database="postgresql+psycopg2://username:password@172.22.39.26:8000/db",  # where to save the sqlite database holding intermediate data
        db_schema="dedupe",
        path_model="./.dedupe/test_model",  # where to save the model
        label_studio={
            "port": 8089,  # label studio port
            "api_key": "83e2bc3da92741aa41c272829558c596faefa745",  # label studio port
            "description": "chansoo test project",  # label studio description of project
        },
        fast_api={"port": 8090},  # fast api port
    ),
)
settings.save()

To get label studio api_key:

  1. log in (can make up any user/pw).
  2. Go to "Account & Settings" using icon on top-right
  3. Get Access Token and copy/paste into settings at settings.other.label_studio["api_key"]

See dedupe/settings.py for the full settings code.

dedupe

Below is an example that dedupes df on attributes columns specified in settings.

train dedupe

import glob
import pandas as pd
from oagdedupe.api import Dedupe

d = Dedupe(settings=settings)
d.initialize(df=df, reset=True)

# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()

record-linkage

Below is an example that links df to df2, on attributes columns specified in settings (dataframes should share these columns).

train record-linkage

import glob
import pandas as pd
from oagdedupe.api import RecordLinkage

d = RecordLinkage(settings=settings)
d.initialize(df=df, df2=df2, reset=True)

# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()

active learn

For either dedupe or record-linkage, run:

   DEDUPER_NAME="<project name>";
   DEDUPER_FOLDER="<project folder>";
   python -m dedupe.fastapi.main

replacing <project name> and <project folder> with your project settings (for the example above, test and ./.dedupe).

Then return to label-studio and start labelling. When the queue falls under 5 tasks, fastAPI will update the model with labelled samples then send more tasks to review.

predictions

To get predictions, simply run the predict() method.

Dedupe:

d = Dedupe(settings=Settings(name="test", folder="./.dedupe"))
d.predict()

Record-linkage:

d = RecordLinkage(settings=Settings(name="test", folder="./.dedupe"))
d.predict()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oagdedupe-0.1.0.tar.gz (23.9 kB view hashes)

Uploaded Source

Built Distribution

oagdedupe-0.1.0-py3-none-any.whl (28.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page