A low-privilege implementation of hyperloglog for django and postgres
Project description
django_pg_simple_hll
Postgres-only HyperLogLog cardinality aggregation for the Django ORM.
Instead of counting the number of distinct users,
%%time
Session.objects.aggregate(unique_users=Count("user_uuid", distinct=True))
CPU times: user 864 µs, sys: 2.46 ms, total: 3.32 ms
Wall time: 461 ms
{'unique_users': 140000}
you can approximate it,
%%time
Session.objects.aggregate(approx_unique_users=HLLCardinality("user_uuid"))
CPU times: user 480 µs, sys: 1.54 ms, total: 2.02 ms
Wall time: 845 ms
{'approx_unique_users': 144594}
Or, instead of counting the number of distinct users per day,
%%time
list(
Session.objects
.annotate(date_of_session=TruncDate("created"))
.values("date_of_session")
.annotate(unique_users=Count("user_uuid", distinct=True))
.values("unique_users", "date_of_session")
.order_by("date_of_session")
)
CPU times: user 1.47 ms, sys: 3.16 ms, total: 4.64 ms
Wall time: 758 ms
[{'date_of_session': datetime.date(2023, 4, 28), 'unique_users': 20000},
{'date_of_session': datetime.date(2023, 4, 29), 'unique_users': 40000},
{'date_of_session': datetime.date(2023, 4, 30), 'unique_users': 60000},
{'date_of_session': datetime.date(2023, 5, 1), 'unique_users': 80000},
{'date_of_session': datetime.date(2023, 5, 2), 'unique_users': 100000},
{'date_of_session': datetime.date(2023, 5, 3), 'unique_users': 120000},
{'date_of_session': datetime.date(2023, 5, 4), 'unique_users': 140000}]
you can approximate it,
%%time
list(
Session.objects
.annotate(date_of_session=TruncDate("created"))
.values("date_of_session")
.annotate(approx_unique_users=HLLCardinality("user_uuid"))
.values("approx_unique_users", "date_of_session")
.order_by("date_of_session")
)
CPU times: user 1.53 ms, sys: 3.57 ms, total: 5.1 ms
Wall time: 838 ms
[{'date_of_session': datetime.date(2023, 4, 28), 'approx_unique_users': 19322},
{'date_of_session': datetime.date(2023, 4, 29), 'approx_unique_users': 39356},
{'date_of_session': datetime.date(2023, 4, 30), 'approx_unique_users': 61202},
{'date_of_session': datetime.date(2023, 5, 1), 'approx_unique_users': 80917},
{'date_of_session': datetime.date(2023, 5, 2), 'approx_unique_users': 102914},
{'date_of_session': datetime.date(2023, 5, 3), 'approx_unique_users': 125637},
{'date_of_session': datetime.date(2023, 5, 4), 'approx_unique_users': 144594}]
The approximation is sometimes faster and uses less memory, particularly when faceting by other variables.
By default, it uses a precision of 9, which tends to have an error of about 5%. You can set a higher precision (up to 26) through a second parameter:
%%time
list(
Session.objects
.annotate(date_of_session=TruncDate("created"))
.values("date_of_session")
.annotate(approx_unique_users=HLLCardinality("user_uuid", 11))
.values("approx_unique_users", "date_of_session")
.order_by("date_of_session")
)
CPU times: user 1.71 ms, sys: 3.82 ms, total: 5.53 ms
Wall time: 1.98 s
[{'date_of_session': datetime.date(2023, 4, 28), 'approx_unique_users': 20013},
{'date_of_session': datetime.date(2023, 4, 29), 'approx_unique_users': 39616},
{'date_of_session': datetime.date(2023, 4, 30), 'approx_unique_users': 59312},
{'date_of_session': datetime.date(2023, 5, 1), 'approx_unique_users': 81278},
{'date_of_session': datetime.date(2023, 5, 2), 'approx_unique_users': 101880},
{'date_of_session': datetime.date(2023, 5, 3), 'approx_unique_users': 122343},
{'date_of_session': datetime.date(2023, 5, 4), 'approx_unique_users': 141375}]
You can also set a lower precision (down to 4) for a faster, less accurate estimate.
%%time
list(
Session.objects
.annotate(date_of_session=TruncDate("created"))
.values("date_of_session")
.annotate(approx_unique_users=HLLCardinality("user_uuid", 7))
.values("approx_unique_users", "date_of_session")
.order_by("date_of_session")
)
CPU times: user 1.3 ms, sys: 2.85 ms, total: 4.15 ms
Wall time: 547 ms
[{'date_of_session': datetime.date(2023, 4, 28), 'approx_unique_users': 19363},
{'date_of_session': datetime.date(2023, 4, 29), 'approx_unique_users': 40627},
{'date_of_session': datetime.date(2023, 4, 30), 'approx_unique_users': 63159},
{'date_of_session': datetime.date(2023, 5, 1), 'approx_unique_users': 83371},
{'date_of_session': datetime.date(2023, 5, 2), 'approx_unique_users': 96185},
{'date_of_session': datetime.date(2023, 5, 3), 'approx_unique_users': 116565},
{'date_of_session': datetime.date(2023, 5, 4), 'approx_unique_users': 143588}]
The aggregation is also available in SQL for analytics:
select
date_trunc('day', created) as date_of_session,
hll_cardinality(user_uuid, 11) as approx_unique_users
from
testapp_session
group by
date_trunc('day', created)
order by date_of_session;
date_of_session | approx_unique_users
------------------------+---------------------
2023-04-28 00:00:00+00 | 20013
2023-04-29 00:00:00+00 | 39616
2023-04-30 00:00:00+00 | 59312
2023-05-01 00:00:00+00 | 81278
2023-05-02 00:00:00+00 | 101880
2023-05-03 00:00:00+00 | 122343
2023-05-04 00:00:00+00 | 141375
(7 rows)
Time: 2121.662 ms (00:02.122)
SQL implementation
This is a low-privilege implementation of Hyperloglog approximate cardinality aggregation written in SQL and some PL/pgSQL. Read about it here,
Should I use this?
If you can use an optimised version, you should use that. It will be faster - although I haven't done any benchmarks.
Use this if you can't install extensions on your database, such as on Amazon RDS.
Performance
This is not a scientific test, but here is an example of performance running on an M1 Macbook Pro, running Postgres 14 in Docker.
For a dataset of 560k rows and 140k unique values a precision of 6 matches the speed of COUNT(DISTINCT ...)
:
select count(user_uuid) from testapp_session;
count
--------
560000
(1 row)
Time: 115.919 ms
select count(distinct user_uuid) as unique_users from testapp_session;
unique_users
--------------
140000
(1 row)
Time: 434.234 ms
select hll_cardinality(user_uuid, 6) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
136260
(1 row)
Time: 409.562 ms
select hll_cardinality(user_uuid, 11) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
141375
(1 row)
Time: 1309.422 ms (00:01.309)
For a dataset of 5.6M rows and 1.4M unique values:
- with a precision of 7,
hll_aggregate
runs in about 0.79x the speed ofCOUNT(DISTINCT ...)
- with a precision of 8, it roughly matches (0.94x)
select count(user_uuid) from testapp_session;
count
---------
5600000
(1 row)
Time: 6021.708 ms (00:06.022)
select count(distinct user_uuid) as unique_users from testapp_session;
unique_users
--------------
1400000
(1 row)
Time: 5414.085 ms (00:05.414)
select hll_cardinality(user_uuid, 7) as approx_unique_users from
approx_unique_users
---------------------
1308235
(1 row)
Time: 4308.065 ms (00:04.308)
select hll_cardinality(user_uuid, 8) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
1407331
(1 row)
Time: 5083.205 ms (00:05.083)
select hll_cardinality(user_uuid, 11) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
1390983
(1 row)
Time: 12468.054 ms (00:12.468)
For a dataset of 56M rows and 14M unique values:
- with a precision of 8,
hll_aggregate
runs in about 0.68x the speed ofCOUNT(DISTINCT ...)
- with a precision of 9, it runs in about 0.76x
- with a precision of 10, it runs in about 0.87x
- with a precision of 11, it runs in about 1.10x
select count(user_uuid) from testapp_session;
count
----------
56000000
(1 row)
Time: 86902.105 ms (01:26.902)93)
select count(distinct user_uuid) as unique_users from testapp_session;
unique_users
--------------
14000000
(1 row)
Time: 145953.595 ms (02:25.954)
select hll_cardinality(user_uuid, 8) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
13469591
(1 row)
Time: 99688.330 ms (01:39.688)
select hll_cardinality(user_uuid, 9) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
13028181
(1 row)
Time: 111210.513 ms (01:51.211)
select hll_cardinality(user_uuid, 10) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
13147406
(1 row)
Time: 126465.312 ms (02:06.465)
hll_cardinality(user_uuid, 11) as approx_unique_users from testapp_session;
approx_unique_users
---------------------
13363203
(1 row)
Time: 160958.580 ms (02:40.959)
The real advantage though is in combination with other properties:
E.g. this query runs 0.37x quicker if using hll_aggregate
than with COUNT(DISTINCT ...)
:
select
date_trunc('day', created) as date_of_session,
count(distinct user_uuid) as unique_users
from testapp_session
group by
date_trunc('day', created)
order by
date_of_session;
date_of_session | unique_users
------------------------+--------------
2023-04-28 00:00:00+00 | 2000000
2023-04-29 00:00:00+00 | 4000000
2023-04-30 00:00:00+00 | 6000000
2023-05-01 00:00:00+00 | 8000000
2023-05-02 00:00:00+00 | 10000000
2023-05-03 00:00:00+00 | 12000000
2023-05-04 00:00:00+00 | 14000000
(7 rows)
Time: 296766.361 ms (04:56.766)
select
date_trunc('day', created) as date_of_session,
hll_cardinality(user_uuid, 9) as unique_users
from
testapp_session
group by
date_trunc('day', created)
order by
date_of_session;
date_of_session | unique_users
------------------------+--------------
2023-04-28 00:00:00+00 | 1963972
2023-04-29 00:00:00+00 | 4096876
2023-04-30 00:00:00+00 | 5869339
2023-05-01 00:00:00+00 | 7412843
2023-05-02 00:00:00+00 | 9401010
2023-05-03 00:00:00+00 | 11443836
2023-05-04 00:00:00+00 | 13028181
(7 rows)
Time: 110217.665 ms (01:50.218)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file django_pg_simple_hll-0.0.1.tar.gz
.
File metadata
- Download URL: django_pg_simple_hll-0.0.1.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1036-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd0cd67d316c4204bbc1c1b96aea7dd5d441310af01132f3914f91af37d8bdf4 |
|
MD5 | 4cfd00ab679d1bddbc13e7632be8756f |
|
BLAKE2b-256 | 2df8b7308a3ea8dbb72e059a2852a5d37a9660bd4a97c577091f55a29981579b |
Provenance
File details
Details for the file django_pg_simple_hll-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: django_pg_simple_hll-0.0.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.9.16 Linux/5.15.0-1036-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a5b68c46174e5ef3bb40b8a1b8f13ae7a0eb6de5eecb8c52680473c5e18011c |
|
MD5 | 0dd2d035571f6d33f246cb08f257be7d |
|
BLAKE2b-256 | 406894b01a953e49edae68c60594045bcf40220ac2c6c2031c203d4780dc6788 |