oagdedupe is a Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches.
Project description
oagdedupe
oagdedupe is a Python library for scalable entity resolution, using active learning to learn blocking configurations, generate comparison pairs, then clasify matches.
page contents
Documentation
You can find the documentation of oagdedupe at https://deduper.readthedocs.io/en/latest/, where you can find the api reference, guide to methodology, and examples.
Installation
[tbd pip install instructions]
start label-studio
Start label-studio using docker command below, updating [LS_PORT]
to the
port on your host machine
docker run -it -p [LS_PORT]:8080 -v `pwd`/cache/mydata:/label-studio/data \
--env LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true \
--env LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/files \
-v `pwd`/cache/myfiles:/label-studio/files \
heartexlabs/label-studio:latest label-studio
postgres
[insert instructions here about initializing postgres]
most importantly, need to create functions (dedupe/postgres/funcs.py)
project settings
Make a dedupe.settings.Settings
object. For example:
from oagdedupe.settings import (
Settings,
SettingsOther,
)
settings = Settings(
name="default", # the name of the project, a unique identifier
folder="./.dedupe", # path to folder where settings and data will be saved
other=SettingsOther(
n=5000, # active-learning samples per learning loop
k=3, # max_len of block conjunctions
cpus=20, # parallelize distance computations
attributes=["givenname", "surname", "suburb", "postcode"], # list of entity attribute names
path_database="postgresql+psycopg2://username:password@172.22.39.26:8000/db", # where to save the sqlite database holding intermediate data
db_schema="dedupe",
path_model="./.dedupe/test_model", # where to save the model
label_studio={
"port": 8089, # label studio port
"api_key": "83e2bc3da92741aa41c272829558c596faefa745", # label studio port
"description": "chansoo test project", # label studio description of project
},
fast_api={"port": 8090}, # fast api port
),
)
settings.save()
To get label studio api_key:
- log in (can make up any user/pw).
- Go to "Account & Settings" using icon on top-right
- Get Access Token and copy/paste into settings at
settings.other.label_studio["api_key"]
See dedupe/settings.py for the full settings code.
dedupe
Below is an example that dedupes df
on attributes columns specified in settings.
train dedupe
import glob
import pandas as pd
from oagdedupe.api import Dedupe
d = Dedupe(settings=settings)
d.initialize(df=df, reset=True)
# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()
record-linkage
Below is an example that links df
to df2
, on attributes columns specified
in settings (dataframes should share these columns).
train record-linkage
import glob
import pandas as pd
from oagdedupe.api import RecordLinkage
d = RecordLinkage(settings=settings)
d.initialize(df=df, df2=df2, reset=True)
# %%
# pre-processes data and stores pre-processed data, comparisons, ID matrices in SQLite db
d.fit_blocks()
active learn
For either dedupe or record-linkage, run:
DEDUPER_NAME="<project name>";
DEDUPER_FOLDER="<project folder>";
python -m dedupe.fastapi.main
replacing <project name>
and <project folder>
with your project settings (for the example above, test
and ./.dedupe
).
Then return to label-studio and start labelling. When the queue falls under 5 tasks, fastAPI will update the model with labelled samples then send more tasks to review.
predictions
To get predictions, simply run the predict()
method.
Dedupe:
d = Dedupe(settings=Settings(name="test", folder="./.dedupe"))
d.predict()
Record-linkage:
d = RecordLinkage(settings=Settings(name="test", folder="./.dedupe"))
d.predict()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for oagdedupe-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3b857a7e9bff6c9432608013f6ad0e926fd1e8cddb1fec3fefd18be6d6c4d44a |
|
MD5 | 68782f96f022628f0ca595db86f9aae5 |
|
BLAKE2b-256 | 8797dc554d90cdb1641e5af8f5e707680ca14f20db77fe351611fea7028339b8 |