Column store implementation for ftm data based on clickhouse
Project description
ftm-columnstore
This library provides methods to store, fetch and list entities formatted as
followthemoney
data as
datasets stored in a column store backend using
clickhouse
This roughly follows the functionality and features from followthemoney-store but with a huge performance benefit on writing and querying data.
FtM
data is stored in one table in statements format.
Minimum Python version: 3.10
Usage
Set up a running clickhouse instance (pointed to via DATABASE_URI
env var,
default: localhost
), for developing purposes this could work:
make clickhouse
Then initialize the required table schema:
ftmcs init
Or drop existing data and recreate:
ftmcs init --recreate
To test if it's working, run a raw query:
ftmcs query "SHOW TABLES"
When using the make clickhouse
command, you can play around with SQL queries
in your browser: http://127.0.0.1:8123/play
Command-line usage
# Insert a bunch of FtM entities into a store:
cat ftm-entities.ijson | ftmcs write -d my_dataset
# Re-create the entities in aggregated form:
ftmcs iterate -d my_dataset | alephclient write-entities -f my_dataset
Usage: ftmcs [OPTIONS] COMMAND [ARGS]...
Store FollowTheMoney object data in a column store (Clickhouse)
Options:
--log-level TEXT Set logging level [default: info]
--uri TEXT Database connection URI [default: localhost]
--table TEXT Database table [default: ftm]
--help Show this message and exit.
Commands:
delete Delete dataset or complete store
fingerprints Generate fingerprint statements as csv from json entities...
flatten Turn json entities from `infile` into statements in csv...
init Initialize database and table.
iterate Iterate entities
list List datasets in a store
query Execute raw query and print result (csv format) to outfile
statements Dump all statements as csv
write Write json entities from `infile` to store.
Python Library
from ftm_columnstore import get_dataset
dataset = get_dataset("us-ofac")
dataset.store.put(entity)
entity = dataset.store.get("entity-id")
Bulk writer behaves the same like in followthemoney-store
:
from ftm_columnstore import get_dataset
dataset = get_dataset("us-ofac")
bulk = dataset.store.bulk()
for entity in many_entities:
bulk.put(entity)
bulk.flush()
Querying entities
There is some weird and unintuitive stuff going on building these queries as
turning the statements back into FtM
entities is a bit hacky here, but from
top-level, it feels quite nice:
from ftm_columnstore.query import EntityQuery
q = EntityQuery().where(schema="Person")
# queries are always streaming result iterator
entities = [e for e in q]
# querying for properties:
q = EntityQuery().where(schema="Payment", amount__gte=1000)
benchmarks
OpenSanctions dataset
tl;dr For a small dataset, the original followthemoney-store
implementation is faster, as the columnstore implementation has the statements
overhead. Benefits come from the explicit querying features.
curl -s https://data.opensanctions.org/datasets/latest/all/entities.ftm.json\?`date '+%s'` > opensanctions.ftm.ijson
All tests run on the same Dell XPS Laptop, 16 GB, 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
write
ftm-columnstore
: 99s
date +"%H:%M:%S" ; ftmcs write -d opensanctions -i opensanctions.ftm.ijson ; date +"%H:%M:%S"
15:44:51
ftm_columnstore.dataset [INFO] [opensanctions] Write: 100000 entities with 532100 statements.
# ...
ftm_columnstore.dataset [INFO] [opensanctions] Write: 536131 entities with 5565069 statements.
15:46:10
followthemoney-store
(postgres): 50s
date +"%H:%M:%S" ; ftm store write -d opensanctions -i opensanctions.ftm.ijson ; date +"%H:%M:%S"
15:46:55
ftmstore [INFO] Write [opensanctions]: 10000 entities
# ...
ftmstore [INFO] Write [opensanctions]: 530000 entities
15:47:45
iterate
followthemoney-store
(postgres): 46s
date +"%H:%M:%S" ; ftm store iterate -d opensanctions > /dev/null ; date +"%H:%M:%S"
15:48:34
15:49:20
ftm-columnstore
: 87s
date +"%H:%M:%S" ; ftmcs iterate -d opensanctions > /dev/null ; date +"%H:%M:%S"
15:49:27
15:50:54
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ftm_columnstore-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a035b271d2317079cf412db91e1fc00158cd886296e464846e43aee01a450976 |
|
MD5 | f9fdd0528a6be114f20e66985c02ac11 |
|
BLAKE2b-256 | e78692c2faa041f58f83a0037c86654683d57fb03931133e7f2e9a671b170d9d |