Clickhouse backend support for Splink
Project description
splinkclickhouse
Basic Clickhouse support for use as a backend with the data-linkage and deduplication package Splink.
Supports in-process chDB version or a clickhouse server connected via clickhouse connect.
Installation
You can install the package from github:
# for v0.2.3 - replace with any version you want, or specify a branch after '@'
pip install git+https://github.com/ADBond/splinkclickhouse.git@v0.2.3
Use
chDB
Import ChDBAPI
, which accepts a connection from chdb.api
:
import splink.comparison_library as cl
from chdb import dbapi
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ChDBAPI
con = dbapi.connect()
db_api = ChDBAPI(con)
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
See Splink documentation for use of the Linker
.
Clickhouse server
Import ClickhouseAPI
, which accepts a clickhouse_connect
client, configured with attributes relevant for your connection:
import clickhouse_connect
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ClickhouseAPI
df = splink_datasets.fake_1000
conn_atts = {
"host": "localhost",
"port": 8123,
"username": "splinkognito",
"password": "splink123!",
}
db_name = "__temp_splink_db"
default_client = clickhouse_connect.get_client(**conn_atts)
default_client.command(f"CREATE DATABASE IF NOT EXISTS {db_name}")
client = clickhouse_connect.get_client(
**conn_atts,
database=db_name,
)
db_api = ClickhouseAPI(client)
# can have at most one tf-adjusted comparison, see caveats below
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.JaccardAtThresholds("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
See Splink documentation for use of the Linker
.
Known issues / caveats
Datetime parsing
Clickhouse offers several different date formats.
The basic Date
format cannot handle dates before the epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.
The parsing function parseDateTime
(and variants) which support providing custom formats return a DateTime
, which also has the above limited range.
In splinkclickhouse
we use the function parseDateTime64BestEffortOrNull
so that we can use the extended-range DateTime64
data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no DateTime64
equivalent of parseDateTime
exists.
If you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the DateTime
range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.
There is not currently a way in Clickhouse to deal directly with date values before 1900 - if you require such values you will have to manually process these to a different type, and construct the relevant SQL logic.
NULL
values in chdb
When passing data into chdb
from pandas or pyarrow tables, NULL
values in String
columns are converted into empty strings, instead of remaining NULL
.
For now this is not handled within the package. You can workaround the issue by wrapping column names in NULLIF
:
import splink.comparison_level as cl
first_name_comparison = cl.DamerauLevenshteinAtThresholds("NULLIF(first_name, '')")
Term-frequency adjustments
Currently at most one term frequency adjustment can be used with ClickhouseAPI
.
This also applies to ChDBAPI
but only in debug_mode
. With debug_mode
off there is no limit on term frequency adjustments.
ClickhouseAPI
pandas registration
ClickhouseAPI
will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string and integer columns, and will always make them Nullable
.
If you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the Linker
instead.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file splinkclickhouse-0.2.3.tar.gz
.
File metadata
- Download URL: splinkclickhouse-0.2.3.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 093a03d98852679a74e361eb2e6efe5f20a8c86e515aea6066ea5cde3baa1ee7 |
|
MD5 | 2d9acbf7d969ed69f3e18466f45d7090 |
|
BLAKE2b-256 | 1bba02e8a1234d5c2c1aba1307e67dce160a04e9b6b2da6a9a77a7c7dae16d9d |
File details
Details for the file splinkclickhouse-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: splinkclickhouse-0.2.3-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a3fd17235dbc8ed9c11a60ad887b02cff7c3782dbbc257061340b748523c4a5 |
|
MD5 | 1f3d2793c06cf4cebe4e151494cb84c6 |
|
BLAKE2b-256 | 3751fd82e045d454cee2d3527c3f380282077e4daba6151e304e8caa36cec1aa |