Fast and Efficient PostgreSQL data into numpy/pandas
Project description
flaco
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀
Have a gander at the initial benchmarks 🏋
flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql
and about ~3x faster. However, it's probably 50x less stable at the moment. 😜
To whet your appetite, here's a memory profile between flaco and pandas.read_sql
on a table with 2M rows with columns of various types. (see test_benchmarks.py)
*If the test data has null values, you can expect a ~3x saving, instead of the ~5x
you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce
.
Line # Mem usage Increment Occurences Line Contents
============================================================
99 140.5 MiB 140.5 MiB 1 @profile
100 def memory_profile():
101 140.5 MiB 0.0 MiB 1 stmt = "select * from test_table"
102
103
104 140.9 MiB 0.4 MiB 1 with Database(DB_URI) as con:
105 441.8 MiB 300.9 MiB 1 data = read_sql(stmt, con)
106 441.8 MiB 0.0 MiB 1 _flaco_df = pd.DataFrame(data, copy=False)
107
108
109 441.8 MiB 0.0 MiB 1 engine = create_engine(DB_URI)
110 2091.5 MiB 1649.7 MiB 1 _pandas_df = pd.read_sql(stmt, engine)
Example
from flaco.io import read_sql, Database
uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"
with Database(uri) as con:
data = read_sql(stmt, con) # dict of column name to numpy array
# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)
# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array
# allocation without needing to resize during query reading.
with Database(uri) as con:
data = read_sql(stmt, con, 1_000)
Notes
Is this a drop in replacement for
pandas.read_sql
?
No. It varies in a few ways:
- It will return a
dict
ofstr
➡numpy.ndarray
objects. But this can be passed with zero copies topandas.DataFrame
- When querying integer columns, if a null is encountered, the array will be
converted to
dtype=object
and nulls from PostgreSQL will beNone
. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basicallynumpy.nan
types. - It lacks basically all of the options
pandas.read_sql
has.
Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:
- 📝 Poor/non-existant error messages
- 💩 Core dumps
- 🚰 Memory leaks (although I think most are handled now)
- 🦖 Almost complete lack of exception handling from underlying Rust/C interface
- 📍 PostgreSQL
numeric
type should ideally be converted todecimal.Decimal
but usesf64
for now; potentially loosing precision. Note, this is exactly whatpandas.read_sql
does. - ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter
such types, either convert them to a supported type like text/json/jsonb
(ie
select my_field::text ...
), or open an issue if a standard type is not supported.
License
Why did you choose such lax licensing? Could you change to a copy left license, please?
...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for flaco-0.4.0-pp37-pypy37_pp73-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4087fd8e87eee2c8457fe69d5693cd2c0ae481dd34ebc28839a8e9a6c0bb7ae |
|
MD5 | 179f89d35a2f9294a70512abb9ce5469 |
|
BLAKE2b-256 | 4a1ca82ded14edcd7eddf9382ab797d5fff2ad4f811f8a6eb007f9926b9ba5d1 |
Hashes for flaco-0.4.0-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3eb05c3a0f7af98c6cfcf2acb1c0a6c6a24f20685819bd3cf3eefc57f9a79633 |
|
MD5 | 402ac2676c261bd15736a75e79fe1eec |
|
BLAKE2b-256 | 660d4d67e5c697da01e21bfb6a76c1e84f3aea3bcc7938031bd593c7052814b4 |
Hashes for flaco-0.4.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 272833b92a89a31ec42fbef63ec2633f625180daf066630824d26573c5c6914e |
|
MD5 | 649b4debeb98b59c9b2981805a8d9598 |
|
BLAKE2b-256 | 27ac7247689242c14c06580c55468112ede3836d4a5a1345e2d5f895163b7501 |
Hashes for flaco-0.4.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7e854fa5bba5c0b887123fcd1a0c711bbfe57fe3a5b8bbb41012097597ba97f |
|
MD5 | b85eb29a142d97d1984c54ed890add2a |
|
BLAKE2b-256 | ad6f8fd5be830b1c90d0726a90e7d9ed99197d1da374e6071f088e77b03ca996 |
Hashes for flaco-0.4.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac14a5e2564d2f6d8d678058ad46cee2088b010368f17f5999e990e4d2f1dd34 |
|
MD5 | 6f10980098385469dc40ced3c12f7088 |
|
BLAKE2b-256 | 8002f594bc19271c12e90df876d391feadd6ee74715fda226e3d4d041ff2399a |
Hashes for flaco-0.4.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d87acec0332e12c95b1937efd4d0f97cfa086e573672b29af8e23831c7bbf74 |
|
MD5 | 1232f7675baec5e50d274fee6bf715b5 |
|
BLAKE2b-256 | 03741b422c5fbca79e333d1275d601dfae49f11e71d8367aa7296573a114594f |
Hashes for flaco-0.4.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52b75d1485240de874c233fb73491891646112f5bb5af71f9bbe9d138a7189cf |
|
MD5 | df38754b29103974595031c8169643c8 |
|
BLAKE2b-256 | ad735804e4c81ed2f8159e641cd7eff215f46d09d7538dfd14927bcb59c2ada2 |
Hashes for flaco-0.4.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 66ac7e102a1dd517631695527929a4dfb6ce5ad2b27fbe424c070cf88c049629 |
|
MD5 | 331ace588955f79d4722759d3004151f |
|
BLAKE2b-256 | c7aee609f86808794d08106cc60106a57425b546adeb1970cacc9f1dcf26bc5c |
Hashes for flaco-0.4.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54d3e98b228d3b33d35c3d256d762c2e79bd67c4f8b7fd67dfd11a27ab16573a |
|
MD5 | 3032c313aa05318ee305c3e6f3792445 |
|
BLAKE2b-256 | 48b5347217e8b5e4eb4567c5db6affacb6315137e51cf1c4f486a5ca7811b4ab |
Hashes for flaco-0.4.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ff01c764d471abb06f736334630fbf231e8739b48531c1622903b5a24e3f38e |
|
MD5 | 59d4b7081802e70128b35502d85c8e70 |
|
BLAKE2b-256 | e05e5954a2157910990ef4eb0a13be5f68de89697184953cddefeb065503e539 |
Hashes for flaco-0.4.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0c027b45d8c34b8e3f1876494ff66bb91385c2b2b2094e57b0656788d4ee8d9 |
|
MD5 | 4d73d5533c0936b9ce5449fb7f940e3b |
|
BLAKE2b-256 | 758dca635c0ed0593fbea344094bbcfead0ce8d979563b875466f4bce21d8153 |
Hashes for flaco-0.4.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8bd8323eef6edfd2591071fac438f0e046af58bcf2741f9695a27482274c52d |
|
MD5 | 048be740185fc77063716bf6c034a5c1 |
|
BLAKE2b-256 | f49f8092e5687862ed9a5f113e960663ebd47d56237a651d8d71953d46f54e07 |