Clickhouse backend support for Splink
Project description
splinkclickhouse
Basic Clickhouse support for use as a backend with the data-linkage and deduplication package Splink.
Supports clickhouse server connected via clickhouse connect.
Also supports in-process chDB version if installed with the chdb
extras.
Installation
Install from PyPI
using pip
:
# just installs the Clickhouse server dependencies
pip install splinkclickhouse
# or to install with support for chdb:
pip install splinkclickhouse[chdb]
or you can install the package directly from github:
# Replace with any version you want, or specify a branch after '@'
pip install git+https://github.com/ADBond/splinkclickhouse.git@v0.3.2
If instead you are using conda
, splinkclickhouse
is available on conda-forge:
conda install conda-forge::splinkclickhouse
Note that the conda
version will only be able to use the Clickhouse server functionality as chdb
is not currently available within conda
.
While the package is in early development there will may be breaking changes in new versions without warning, although these should only occur in new minor versions. Nevertheless if you depend on this package it is recommended to pin a version to avoid any disruption that this may cause.
Use
Clickhouse server
Import ClickhouseAPI
, which accepts a clickhouse_connect
client, configured with attributes relevant for your connection:
import clickhouse_connect
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ClickhouseAPI
df = splink_datasets.fake_1000
conn_atts = {
"host": "localhost",
"port": 8123,
"username": "splinkognito",
"password": "splink123!",
}
db_name = "__temp_splink_db"
default_client = clickhouse_connect.get_client(**conn_atts)
default_client.command(f"CREATE DATABASE IF NOT EXISTS {db_name}")
client = clickhouse_connect.get_client(
**conn_atts,
database=db_name,
)
db_api = ClickhouseAPI(client)
# can have at most one tf-adjusted comparison, see caveats below
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.JaccardAtThresholds("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
See Splink documentation for use of the Linker
.
chDB
To use chdb
as a Splink backend you must install the chdb
package.
This is automatically installed if you install with the chdb
extras (pip install splinkclickhouse[chdb]
).
Import ChDBAPI
, which accepts a connection from chdb.api
:
import splink.comparison_library as cl
from chdb import dbapi
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ChDBAPI
con = dbapi.connect()
db_api = ChDBAPI(con)
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
See Splink documentation for use of the Linker
.
Comparisons
splinkclickhouse
is compatible with all of the in-built splinks
comparisons and comparison levels in splink.comparison_library
and splink.comparison_level_library
.
However, splinkclickhouse
provides a few pre-made extras to leverage Clickhouse-specific functionality.
These can be used in exactly the same way as the native Splink libraries, for example:
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_library as cl_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl_ch.DistanceInKMAtThresholds(
"latitude",
"longitude",
[10, 50, 100, 200, 500],
),
],
)
or with individual comparison-levels:
import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_level_library as cll_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl.CustomComparison(
comparison_levels = [
cll.And(
cll.NullLevel("city"),
cll.NullLevel("postcode"),
cll.Or(cll.NullLevel("latitude"), cll.NullLevel("longitude"))
),
cll.ExactMatch("postcode"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 5),
cll_ch.DistanceInKMLevel("latitude", "longitude", 10),
cll.ExactMatch("city"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 50),
cll.ElseLevel(),
],
output_column_name="location",
),
],
)
Support
If you have difficulties with the package you can open an issue. You may also suggest changes by opening a PR, although it may be best to discuss in an issue beforehand.
This package is 'unofficial', in that it is not directly supported by the Splink team. Maintenance / improvements will be done on a 'best effort' basis where resources allow.
Known issues / caveats
Datetime parsing
Clickhouse offers several different date formats.
The basic Date
format cannot handle dates before the Unix epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.
The parsing function parseDateTime
(and variants) which support providing custom formats return a DateTime
, which also has the above limited range.
In splinkclickhouse
we use the function parseDateTime64BestEffortOrNull
so that we can use the extended-range DateTime64
data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no DateTime64
equivalent of parseDateTime
exists.
If you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the DateTime
range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.
Extended Dates
There is not currently a way in Clickhouse to deal directly with date values before 1900. However, splinkclickhouse
offers some tools to help with this.
It creates a SQL UDF (which can be opted-out of) days_since_epoch
, to convert a date string (in YYYY-MM-DD
format) into an integer, representing the number of days since 1970-01-01
to handle dates well outside the range of DateTime64
, based on the proleptic Gregorian calendar.
This can be used with column expression extension splinkclickhouse.column_expression.ColumnExpression
via the transform .parse_date_to_int()
, or using custom versions of Splink library functions cll.AbsoluteDateDifferenceLevel
, cl.AbsoluteDateDifferenceAtThresholds
, and cl.DateOfBirthComparison
.
These functions can be used with string columns (which will be wrapped in the above parsing function), or integer columns if the conversion via days_since_epoch
is already done in the data-preparation stage.
NULL
values in chdb
When passing data into chdb
from pandas or pyarrow tables, NULL
values in String
columns are converted into empty strings, instead of remaining NULL
.
For now this is not handled within the package. You can workaround the issue by wrapping column names in NULLIF
:
import splink.comparison_level as cl
first_name_comparison = cl.DamerauLevenshteinAtThresholds("NULLIF(first_name, '')")
Term-frequency adjustments
Currently at most one term frequency adjustment can be used with ClickhouseAPI
.
This also applies to ChDBAPI
but only in debug_mode
. With debug_mode
off there is no limit on term frequency adjustments.
ClickhouseAPI
pandas registration
ClickhouseAPI
will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them Nullable
.
If you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the Linker
instead.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file splinkclickhouse-0.3.2.tar.gz
.
File metadata
- Download URL: splinkclickhouse-0.3.2.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 537af05dcf899d6827ed11c9df85f4da067653f8dc872f7500f5b18327263177 |
|
MD5 | f7232c002069653ae2288a6a2776af5f |
|
BLAKE2b-256 | 26e71420240280cfb80d78d646dff7809b3dba3bd0d257347e68d57773f930ca |
File details
Details for the file splinkclickhouse-0.3.2-py3-none-any.whl
.
File metadata
- Download URL: splinkclickhouse-0.3.2-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 911e52e6c6ddffea2246aae0cbd5a3930a81d53b0a6aad5c1963ac0b8c8da032 |
|
MD5 | d2a14b6514380f8d54fddd5c9a2889be |
|
BLAKE2b-256 | a379aa65ba91d8cbd226097b80d467e3c9a240df45fc4fc764b69b0f1b75946d |