Fast and Efficient PostgreSQL data into numpy/pandas
Project description
flaco
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀
Have a gander at the initial benchmarks 🏋
flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql
and about ~3x faster. However, it's probably 50x less stable at the moment. 😜
To whet your appetite, here's a memory profile between flaco and pandas.read_sql
on a table with 2M rows with columns of various types. (see test_benchmarks.py)
*If the test data has null values, you can expect a ~3x saving, instead of the ~5x
you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce
.
Line # Mem usage Increment Occurences Line Contents
============================================================
99 140.5 MiB 140.5 MiB 1 @profile
100 def memory_profile():
101 140.5 MiB 0.0 MiB 1 stmt = "select * from test_table"
102
103
104 140.9 MiB 0.4 MiB 1 with Database(DB_URI) as con:
105 441.8 MiB 300.9 MiB 1 data = read_sql(stmt, con)
106 441.8 MiB 0.0 MiB 1 _flaco_df = pd.DataFrame(data, copy=False)
107
108
109 441.8 MiB 0.0 MiB 1 engine = create_engine(DB_URI)
110 2091.5 MiB 1649.7 MiB 1 _pandas_df = pd.read_sql(stmt, engine)
Example
from flaco.io import read_sql, Database
uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"
with Database(uri) as con:
data = read_sql(stmt, con) # dict of column name to numpy array
# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)
# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array
# allocation without needing to resize during query reading.
with Database(uri) as con:
data = read_sql(stmt, con, 1_000)
Notes
Is this a drop in replacement for
pandas.read_sql
?
No. It varies in a few ways:
- It will return a
dict
ofstr
➡numpy.ndarray
objects. But this can be passed with zero copies topandas.DataFrame
- When querying integer columns, if a null is encountered, the array will be
converted to
dtype=object
and nulls from PostgreSQL will beNone
. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basicallynumpy.nan
types. - It lacks basically all of the options
pandas.read_sql
has.
Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:
- 📝 Poor/non-existant error messages
- 💩 Core dumps
- 🚰 Memory leaks (although I think most are handled now)
- 🦖 Almost complete lack of exception handling from underlying Rust/C interface
- 📍 PostgreSQL
numeric
type should ideally be converted todecimal.Decimal
but usesf64
for now; potentially loosing precision. Note, this is exactly whatpandas.read_sql
does. - ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter
such types, either convert them to a supported type like text/json/jsonb
(ie
select my_field::text ...
), or open an issue if a standard type is not supported.
License
Why did you choose such lax licensing? Could you change to a copy left license, please?
...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for flaco-0.4.1-pp37-pypy37_pp73-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dafb35db113a4db6799fdf81cf127d3dd86a827aaa66571382e60d8673c6caf4 |
|
MD5 | 92e3a7ba062bc9c39ca5ec53fcf66575 |
|
BLAKE2b-256 | 84531f58fc7a76b7db6e20e65c127facbfa35a7ba26df389997f533bdc498566 |
Hashes for flaco-0.4.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bb65a4d9a93d2af5bcf67e8c9f432d0c9e71b224c00789715e6d59b7d199e5b |
|
MD5 | 3d6d399606ab400d41aaf78aa92a33f2 |
|
BLAKE2b-256 | 1d67b03546d907219e17b9392a8fd34e177ce4d2d11e1921494035f25b1ba49e |
Hashes for flaco-0.4.1-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50e23943457b8ef18a1a7712695a8f3df17061ee76c0c4f6a07192816d2170d7 |
|
MD5 | bce90cbdaf9fade78a12742c9444905f |
|
BLAKE2b-256 | 6eb7495c78365da682aebd7fb29c5af08363c703622f9acaa6898a85ca34ed8d |
Hashes for flaco-0.4.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5f0a226f401f3f6c0547a859ab64b5f3f1b04ea0c1461e6d8fb2e470804a029 |
|
MD5 | da4069ca89b600e1762478c963ac0681 |
|
BLAKE2b-256 | 0e01c3d3cc05e1c077577b032d7872d26bf02998204f5d2591efbe18f21e15bf |
Hashes for flaco-0.4.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bd326376f19260370eceadd8f35761d82477e08cae890de573ff65ee9661a8c |
|
MD5 | 1f1ba254f353527aefdf0df77410922b |
|
BLAKE2b-256 | 2eb0877199c6924452f1e6abe1235fc92c7ea2255748a4f6ea1e1c34cc1fea60 |
Hashes for flaco-0.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 777a1870965093a2b00cfe040963a1f833acd55b9e1e4756184f120910c8e2ec |
|
MD5 | 8a308879e24e55ab1e7b3e53ade1757b |
|
BLAKE2b-256 | 2bf94afa950b23393acf744a2cfc101993b5023bd45acbb04336d3b69b41e62e |
Hashes for flaco-0.4.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94ad5a399dcaeed377c87426ec98b4a4b99cb4fbfda0d4c470b18df83a0ac8d8 |
|
MD5 | 34166ccadc51a9a72c63821f729afa3d |
|
BLAKE2b-256 | 8d8f82078b38f762439f8697fb0f26012d63cd36c4b45d4299fe2e2c75d281f4 |
Hashes for flaco-0.4.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2669d20d0e16901992f029aba45019047aa7d8d89560492afa1437ea47e4c60e |
|
MD5 | 3e85aa4125798e27b199ed7b5d44d116 |
|
BLAKE2b-256 | 4f0fcaf90ee1af97c45207e619358d10b77f156e40d614117fc19d4e40bb862c |
Hashes for flaco-0.4.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d75b1ee447c36f15309495d05b5fb70b41d3f3c6ddbb921d819b36e8bc1b0b51 |
|
MD5 | 747ffe534420601798f754d945775b31 |
|
BLAKE2b-256 | 544ad87a7d84148e1fb8ea0561df5d4ee49246a75441dbbb3282d6a4d1941af6 |
Hashes for flaco-0.4.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e997b72ef968eebd55c687b7148696d112c5d6a6936925257173c7277ea50a3 |
|
MD5 | ba115c4e545b6909feb4bd5103809c5c |
|
BLAKE2b-256 | 8b9c1f2ab0b32fa4711ff1ef390ed92f25518875b353a4b1008d906d6191b8d5 |
Hashes for flaco-0.4.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 922b7d3ece29667340790b475d65fea7e31f2eebfb525427d68c5d97f15447bd |
|
MD5 | b4ce491e46a0ff9fee34e36a3eb93dca |
|
BLAKE2b-256 | b7990342552c50c6822d8f737d88c7080209a8082c4ef825ae82d93963fce9d2 |
Hashes for flaco-0.4.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ab8ea451c6e8ee5f90b95c59c92740a40b2750ac9e4416173c2660c92c56f79 |
|
MD5 | 636dc3d5ef0b968cb2a2f89c2b8a912a |
|
BLAKE2b-256 | 1f2a4fab219d49aeebae3bb8db6dfcf1c3925a90488476418abdaf21b19dfe27 |