Fast and Efficient PostgreSQL data into numpy/pandas
Project description
flaco
Perhaps the fastest* and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀
Have a gander at the initial benchmarks 🏋
flaco tends to use nearly ~3-6x less memory than standard pandas.read_sql
and about ~2-3x faster. However, it's probably 50x less stable at the moment. 😜
To whet your appetite, here's a memory profile between flaco, connectorx
and pandas.read_sql
on a table with 1M rows with columns of various types.
(see test_benchmarks.py) *If the data,
specifically integer types, has null values, you can expect a bit lower savings
what you see here; therefore (hot tip 🔥), supply fill values in your queries
where possible via coalesce
.
Line # Mem usage Increment Occurences Line Contents
============================================================
118 97.9 MiB 97.9 MiB 1 @profile
119 def memory_profile():
120 97.9 MiB 0.0 MiB 1 stmt = "select * from test_table"
121
122 354.9 MiB 257.0 MiB 1 _cx_df = cx.read_sql(DB_URI, stmt, return_type="pandas")
123
124 354.9 MiB 0.0 MiB 1 with Database(DB_URI) as con:
125 533.9 MiB 178.9 MiB 1 data = read_sql(stmt, con)
126 541.2 MiB 7.3 MiB 1 _flaco_df = pd.DataFrame(data, copy=False)
127
128 545.3 MiB 4.1 MiB 1 engine = create_engine(DB_URI)
129 1680.9 MiB 1135.5 MiB 1 _pandas_df = pd.read_sql(stmt, engine)
Example
from flaco.io import read_sql, Database
uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"
with Database(uri) as con:
data = read_sql(stmt, con) # dict of column name to numpy array
# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)
# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array
# allocation without needing to resize during query reading.
with Database(uri) as con:
data = read_sql(stmt, con, 1_000)
Notes
Is this a drop in replacement for
pandas.read_sql
?
No. It varies in a few ways:
- It will return a
dict
ofstr
➡numpy.ndarray
objects. But this can be passed with zero copies topandas.DataFrame
- When querying integer columns, if a null is encountered, the array will be
converted to
dtype=object
and nulls from PostgreSQL will beNone
. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basicallynumpy.nan
types. - It lacks basically all of the options
pandas.read_sql
has.
How does this compare with connectorx?
Connectorx is an exceptionally impressive library, and more mature than flaco. They have much wider support for a range of data sources, while flaco only supports postgres for now.
Performance wise, benchmarking seems to indicate flaco is generally more performant in terms of memory, but connectorx is faster when temporal data types (time, timestamp, etc) are used. If it's pure numeric dtypes, flaco is faster and more memory efficient.
Connectorx will make precheck queries
to the source before starting to download data. Depending on your application,
this can be a significant bottleneck. Specially, some data warehousing queries
are very expensive and even doing a LIMIT 1
, will cause significant load on
the source database.
Flaco will not run any precheck queries. However, you can supply either
n_rows
or size_hint
to flaco.io.read_sql
to give either exact, or a
hint to reduce the number of allocations/resizing of arrays during data loading.
When in doubt, it's likely you should choose connectorx as it's more mature and offers great performance.
Words of caution
While it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:
- 📝 Poor/non-existant error messages
- 💩 Core dumps
- 🚰 Memory leaks (although I think most are handled now)
- 🦖 Almost complete lack of exception handling from underlying Rust/C interface
- 📍 PostgreSQL
numeric
type should ideally be converted todecimal.Decimal
but usesf64
for now; potentially loosing precision. Note, this is exactly whatpandas.read_sql
does. - ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter
such types, either convert them to a supported type like text/json/jsonb
(ie
select my_field::text ...
), or open an issue if a standard type is not supported.
License
Why did you choose such lax licensing? Could you change to a copy left license, please?
...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl
- Upload date:
- Size: 1.1 MB
- Tags: PyPy, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b45fc797adc024a3ba3764750d0bcad2dfd263b285cac9de693f75b8a1aa976a |
|
MD5 | bf919e57eaa6576932e8bc516dc2033b |
|
BLAKE2b-256 | 29852413b5dd96961a21ef6295566f456c2e223d9ab9a84a7de19164c6080135 |
File details
Details for the file flaco-0.5.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: PyPy, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 331b2427b81447a14569ac750613701078836931ede082e9a9fcec4c5bd53d8c |
|
MD5 | af8e8f28430533402e73fb218a66ed81 |
|
BLAKE2b-256 | d2683714e14bd2b5da60ced64db6deaed03d74748ce63b33a6c83008ba29f70c |
File details
Details for the file flaco-0.5.1-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 573.9 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca5e94d572251f1ee789ac3801d6c93583d8ac7ab3a7ecf329cb03f020f8defe |
|
MD5 | 67e0b03d5d1131bc2da79b7dde5ea219 |
|
BLAKE2b-256 | 40fe1655b63bcb99924268dd6763372d2df303bcbb417526c3e579fe8675fbd7 |
File details
Details for the file flaco-0.5.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | efcdb2e4d6ef4c47a9a642b1242aace8c76c3e257ae96fa316fc3d4c9bd473e4 |
|
MD5 | c153fb9482d04db931b85587480d3e04 |
|
BLAKE2b-256 | e64b1388310d2a44ca9b179896e33caa981a65b13fb8de3ff8d70611ac8afe38 |
File details
Details for the file flaco-0.5.1-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 573.8 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b976423ff8bcae6e528f6e25fe8bce30cd030033050815ff275e3f542f23055 |
|
MD5 | a9e360d4c102829f8f63d4de1856d1bc |
|
BLAKE2b-256 | acfc0e6e47aa60cbe764a6a2273f6fde6910bd9a69396d387ac7eda38ab47a51 |
File details
Details for the file flaco-0.5.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb5a4882536bd0e864df9e9a91986177ee49e95cc88feabf91719a0ed4b4cd63 |
|
MD5 | 7d8c97296ff98b0dded9aa5dbc71ff9c |
|
BLAKE2b-256 | 1a23a4a28ad86b8f9949b6cecbbd5dc68fef6074849347310337f68123c9804d |
File details
Details for the file flaco-0.5.1-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 573.7 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 087a7336050ade71a0e7bee2abfa3c8ccb2b1ab173994e49dbddc24a80f3693c |
|
MD5 | 3834f4d7e0df4aff179ccb4bbe12f6fa |
|
BLAKE2b-256 | b34dad6506d815a09edbcf6610bde26f2762bcb46585baee62f33d274f2ba088 |
File details
Details for the file flaco-0.5.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 177ea7397115581202a48d141ea25f3a3b6064901f1dd8a0dd4b80c5b103ee63 |
|
MD5 | 5467e7a29be22a585a3d94a93039282d |
|
BLAKE2b-256 | bd347580e6050d524651ad37b68d30b81a6d6bd3ca2dca61c8fa0eb21157704b |
File details
Details for the file flaco-0.5.1-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 573.2 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0280f0af7e8184cf5a3dbadcf92086e824fdd8cf02dbfd02b7a1a58d2c1b3ace |
|
MD5 | 271dc5ba8b3388bb0fa318fca75871fd |
|
BLAKE2b-256 | eb7db58b05d0a89925d3a385b42d94adb65f8594524a5c628fe8185b437d85f3 |
File details
Details for the file flaco-0.5.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1dc4145b0a25a6f77480c89c3e021dc75f993fcb12a6643705885e2acdda1e67 |
|
MD5 | bfeb09ec73452025a1690a0adac16ade |
|
BLAKE2b-256 | 6c0770caa1dc59f2763375003dfc3a4e18cadd1475b909f7b0bdcd1f41b639c0 |
File details
Details for the file flaco-0.5.1-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 578.0 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e7a5161118433ab2067d3470a060377f0f0c3a2b6f81aa4d34ee9a1d2aba8cf |
|
MD5 | 5850d5e6e1b6dfe2d067730e70310ca6 |
|
BLAKE2b-256 | 8e316df633a11203b96fc2d473dae82b5c4159a9fa5ddffdcd06f7f091cc0f5c |
File details
Details for the file flaco-0.5.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
.
File metadata
- Download URL: flaco-0.5.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04499ba0c84b2a91fa6198650ffbaa3b5ecf320bb4bb488b4f29bb4f0553041f |
|
MD5 | 0109a6f950108e7e36e64b198a10ac53 |
|
BLAKE2b-256 | 2d527793215e89c3037f2d5821462cc7ee139ca25558cd4a8696dd8ce43ce64c |