Skip to main content

Fast and Efficient PostgreSQL data into numpy/pandas

Project description

flaco

Code Style CI PyPI PyPI - Wheel Downloads

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Have a gander at the initial benchmarks 🏋

flaco tends to use nearly ~3-5x less memory than standard pandas.read_sql and about ~3x faster. However, it's probably 50x less stable at the moment. 😜

To whet your appetite, here's a memory profile between flaco and pandas.read_sql on a table with 2M rows with columns of various types. (see test_benchmarks.py) *If the test data has null values, you can expect a ~3x saving, instead of the ~5x you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    99    140.5 MiB    140.5 MiB           1   @profile
   100                                         def memory_profile():
   101    140.5 MiB      0.0 MiB           1       stmt = "select * from test_table"
   102                                         
   103                                             
   104    140.9 MiB      0.4 MiB           1       with Database(DB_URI) as con:
   105    441.8 MiB    300.9 MiB           1           data = read_sql(stmt, con)
   106    441.8 MiB      0.0 MiB           1           _flaco_df = pd.DataFrame(data, copy=False)
   107                                         
   108                                             
   109    441.8 MiB      0.0 MiB           1       engine = create_engine(DB_URI)
   110   2091.5 MiB   1649.7 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

Example

from flaco.io import read_sql, Database


uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"

with Database(uri) as con:
    data = read_sql(stmt, con)  # dict of column name to numpy array

# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)

# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array 
# allocation without needing to resize during query reading.
with Database(uri) as con:
    data = read_sql(stmt, con, 1_000)

Notes

Is this a drop in replacement for pandas.read_sql?

No. It varies in a few ways:

  • It will return a dict of strnumpy.ndarray objects. But this can be passed with zero copies to pandas.DataFrame
  • When querying integer columns, if a null is encountered, the array will be converted to dtype=object and nulls from PostgreSQL will be None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically numpy.nan types.
  • It lacks basically all of the options pandas.read_sql has.

Furthermore, while it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:

  • 📝 Poor/non-existant error messages
  • 💩 Core dumps
  • 🚰 Memory leaks (although I think most are handled now)
  • 🦖 Almost complete lack of exception handling from underlying Rust/C interface
  • 📍 PostgreSQL numeric type should ideally be converted to decimal.Decimal but uses f64 for now; potentially loosing precision. Note, this is exactly what pandas.read_sql does.
  • ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie select my_field::text ...), or open an issue if a standard type is not supported.

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

flaco-0.4.1-pp37-pypy37_pp73-win_amd64.whl (1.1 MB view hashes)

Uploaded PyPy Windows x86-64

flaco-0.4.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

flaco-0.4.1-cp310-cp310-win_amd64.whl (580.6 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

flaco-0.4.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

flaco-0.4.1-cp39-cp39-win_amd64.whl (580.5 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

flaco-0.4.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

flaco-0.4.1-cp38-cp38-win_amd64.whl (580.4 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

flaco-0.4.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

flaco-0.4.1-cp37-cp37m-win_amd64.whl (579.5 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

flaco-0.4.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

flaco-0.4.1-cp36-cp36m-win_amd64.whl (579.3 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

flaco-0.4.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page