Fast and Efficient PostgreSQL data into numpy/pandas
Project description
flaco
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀
Have a gander at the initial benchmarks 🏋
flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql
and about ~3x faster. However, it's probably 50x less stable at the moment. 😜
To whet your appetite, here's a memory profile between flaco and pandas.read_sql
on a table with 2M rows with columns of various types. (see test_benchmarks.py)
*If the test data has null values, you can expect a ~3x saving, instead of the ~5x
you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce
.
Line # Mem usage Increment Occurences Line Contents
============================================================
99 140.5 MiB 140.5 MiB 1 @profile
100 def memory_profile():
101 140.5 MiB 0.0 MiB 1 stmt = "select * from test_table"
102
103
104 140.9 MiB 0.4 MiB 1 with Database(DB_URI) as con:
105 441.8 MiB 300.9 MiB 1 data = read_sql(stmt, con)
106 441.8 MiB 0.0 MiB 1 _flaco_df = pd.DataFrame(data, copy=False)
107
108
109 441.8 MiB 0.0 MiB 1 engine = create_engine(DB_URI)
110 2091.5 MiB 1649.7 MiB 1 _pandas_df = pd.read_sql(stmt, engine)
Example
from flaco.io import read_sql, Database
uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"
with Database(uri) as con:
data = read_sql(stmt, con) # dict of column name to numpy array
# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)
# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array
# allocation without needing to resize during query reading.
with Database(uri) as con:
data = read_sql(stmt, con, 1_000)
Notes
Is this a drop in replacement for
pandas.read_sql
?
No. It varies in a few ways:
- It will return a
dict
ofstr
➡numpy.ndarray
objects. But this can be passed with zero copies topandas.DataFrame
- When querying integer columns, if a null is encountered, the array will be
converted to
dtype=object
and nulls from PostgreSQL will beNone
. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basicallynumpy.nan
types. - It lacks basically all of the options
pandas.read_sql
has.
Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:
- 📝 Poor/non-existant error messages
- 💩 Core dumps
- 🚰 Memory leaks (although I think most are handled now)
- 🦖 Almost complete lack of exception handling from underlying Rust/C interface
- 📍 PostgreSQL
numeric
type should ideally be converted todecimal.Decimal
but usesf64
for now; potentially loosing precision. Note, this is exactly whatpandas.read_sql
does. - ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter
such types, either convert them to a supported type like text/json/jsonb
(ie
select my_field::text ...
), or open an issue if a standard type is not supported.
License
Why did you choose such lax licensing? Could you change to a copy left license, please?
...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for flaco-0.3.0-pp37-pypy37_pp73-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cfa23161f69f552fc8aae7015917784c95c05800b05016c2d4efaa1d8eef3c5c |
|
MD5 | 844272a0d2209486b26cbe7efedc91e5 |
|
BLAKE2b-256 | b1891c6d67b772a311b8ccee0e4080239e79b641094fb968586252bc93c99696 |
Hashes for flaco-0.3.0-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7587d5264515c36a5a7322a1864e970d103b2f1948c76e14027c74113e1142d |
|
MD5 | 4ad1a7e4213d8b52f13ed3e8866ba6ad |
|
BLAKE2b-256 | 3956a56b0560450bca3eb3b2c8d481d6f588c667a3c03bc81910840a7ff5292c |
Hashes for flaco-0.3.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24f92b8c2dc332be896862045132414fbf64d24f25b5750f84e65d0c654d8f63 |
|
MD5 | 798b5fef521e5e985bfc51a9f5722961 |
|
BLAKE2b-256 | 76c93bb76f3c2c7405dd5029271aa47155bada8420ab48e1162a70969b920612 |
Hashes for flaco-0.3.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 250c79d84adc4b0731304eb83e706221649765302823c4f35ac6f161407d6deb |
|
MD5 | 57923719e22b36717965871d529a9f00 |
|
BLAKE2b-256 | 372d7c12b67c5496bfa8e9c26fd605da0be854725c23c3f16991e312a4da8fe5 |
Hashes for flaco-0.3.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 349606b46c0caf5b02dcbc1a6f82fa1453c8433562fad7acbb2b24525d7904c1 |
|
MD5 | 1112367d6cce4007d3fec7041c263b33 |
|
BLAKE2b-256 | e4581f4a7b472149a02ba90e3d55bb2995758ca624f6160859d7c2732c8fa158 |
Hashes for flaco-0.3.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a26d3c4ab5d10395cbebee1729080b94a24841a8a057c5ef5db3551a5f46249e |
|
MD5 | af26a8f86d496cd4116608ce3adb67c5 |
|
BLAKE2b-256 | fcc29d4fd496a8a9596fbc424a899bdc1b03341b2c017ea1ef372e7e794f1c3c |
Hashes for flaco-0.3.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e93600261e939a2c0c2931cbfda4f7f8f2c72250f1dad9feb34560692f3d690 |
|
MD5 | 8c39eefc1efa480f71ab7d89a591582f |
|
BLAKE2b-256 | 1f55eb9bb522e52717ef53a20dad32b03b8a61c70cd69d35d1cd5f9425072dd2 |
Hashes for flaco-0.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a71e78b2d61c280adb05d91263c15b1d5b3483d309a17819bed3994db1da03a6 |
|
MD5 | aab28f3032fda4eeef5cd48e7b3ab35f |
|
BLAKE2b-256 | 68679db01c5e77125d9673e5cf02bf1ff33881a2b3664c853509193e4c320a64 |
Hashes for flaco-0.3.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0be6e7b0abd7a195284a8283fdf4ea31c6f342262e945bd97d4dda6ce8558a50 |
|
MD5 | 701ea6242d5b67b956bc4f2da3bb6d40 |
|
BLAKE2b-256 | 74dd761bd11a2844c33488567414376826869d68d589763fba90505adf7db6b5 |
Hashes for flaco-0.3.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14d06ff1238d6145798e07118d6a808be55e1cea1cbc48c5bb27dd7f35ca55aa |
|
MD5 | a0122d76b4d66e316e88dd3fd9f235c0 |
|
BLAKE2b-256 | 923163defec08e9dce8985e0e1bdadc248d56f072f53341280cf82e3a7503ff5 |
Hashes for flaco-0.3.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c654ebceaadcafe5f9f2c98ca8dfc4f0a3d19f16ddd93516eb782c933c04d16e |
|
MD5 | 7f6de0e2f6a88969821ba1a3006f8e0d |
|
BLAKE2b-256 | 4b4e8e2599a279aa17329b2aef1874f8e912a40a8c284227ab41fa371ad19d1e |
Hashes for flaco-0.3.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 12fcf9f9e2d40bfceeb18482c7814b02f7bb72531fdae375a26b42c8ea259edd |
|
MD5 | fb92ec53108f29f4634f56f2cbb3bf67 |
|
BLAKE2b-256 | f34d0b6a404bd65d7ab286b94def549cbe07ec5a16dc7a33be518e11bc7ca48c |