Skip to main content

Fast and Efficient PostgreSQL data into numpy/pandas

Project description

flaco

Code Style CI PyPI PyPI - Wheel Downloads

Perhaps the fastest* and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Have a gander at the initial benchmarks 🏋

flaco tends to use nearly ~3-6x less memory than standard pandas.read_sql and about ~2-3x faster. However, it's probably 50x less stable at the moment. 😜

To whet your appetite, here's a memory profile between flaco, connectorx and pandas.read_sql on a table with 1M rows with columns of various types. (see test_benchmarks.py) *If the data, specifically integer types, has null values, you can expect a bit lower savings what you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   118     97.9 MiB     97.9 MiB           1   @profile
   119                                         def memory_profile():
   120     97.9 MiB      0.0 MiB           1       stmt = "select * from test_table"
   121                                         
   122    354.9 MiB    257.0 MiB           1       _cx_df = cx.read_sql(DB_URI, stmt, return_type="pandas")
   123                                         
   124    354.9 MiB      0.0 MiB           1       with Database(DB_URI) as con:
   125    533.9 MiB    178.9 MiB           1           data = read_sql(stmt, con)
   126    541.2 MiB      7.3 MiB           1           _flaco_df = pd.DataFrame(data, copy=False)
   127                                         
   128    545.3 MiB      4.1 MiB           1       engine = create_engine(DB_URI)
   129   1680.9 MiB   1135.5 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

Example

from flaco.io import read_sql, Database


uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"

with Database(uri) as con:
    data = read_sql(stmt, con)  # dict of column name to numpy array

# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)

# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array 
# allocation without needing to resize during query reading.
with Database(uri) as con:
    data = read_sql(stmt, con, 1_000)

Notes

Is this a drop in replacement for pandas.read_sql?

No. It varies in a few ways:

  • It will return a dict of strnumpy.ndarray objects. But this can be passed with zero copies to pandas.DataFrame
  • When querying integer columns, if a null is encountered, the array will be converted to dtype=object and nulls from PostgreSQL will be None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically numpy.nan types.
  • It lacks basically all of the options pandas.read_sql has.

How does this compare with connectorx?

Connectorx is an exceptionally impressive library, and more mature than flaco. They have much wider support for a range of data sources, while flaco only supports postgres for now.

Performance wise, benchmarking seems to indicate flaco is generally more performant in terms of memory, but connectorx is faster when temporal data types (time, timestamp, etc) are used. If it's pure numeric dtypes, flaco is faster and more memory efficient.

Connectorx will make precheck queries to the source before starting to download data. Depending on your application, this can be a significant bottleneck. Specially, some data warehousing queries are very expensive and even doing a LIMIT 1, will cause significant load on the source database.

Flaco will not run any precheck queries. However, you can supply either n_rows or size_hint to flaco.io.read_sql to give either exact, or a hint to reduce the number of allocations/resizing of arrays during data loading.

When in doubt, it's likely you should choose connectorx as it's more mature and offers great performance.

Words of caution

While it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:

  • 📝 Poor/non-existant error messages
  • 💩 Core dumps
  • 🚰 Memory leaks (although I think most are handled now)
  • 🦖 Almost complete lack of exception handling from underlying Rust/C interface
  • 📍 PostgreSQL numeric type should ideally be converted to decimal.Decimal but uses f64 for now; potentially loosing precision. Note, this is exactly what pandas.read_sql does.
  • ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie select my_field::text ...), or open an issue if a standard type is not supported.

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

flaco-0.5.0-pp37-pypy37_pp73-win_amd64.whl (1.1 MB view details)

Uploaded PyPy Windows x86-64

flaco-0.5.0-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.0 MB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

flaco-0.5.0-cp310-cp310-win_amd64.whl (576.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

flaco-0.5.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

flaco-0.5.0-cp39-cp39-win_amd64.whl (576.8 kB view details)

Uploaded CPython 3.9 Windows x86-64

flaco-0.5.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

flaco-0.5.0-cp38-cp38-win_amd64.whl (576.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

flaco-0.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

flaco-0.5.0-cp37-cp37m-win_amd64.whl (575.8 kB view details)

Uploaded CPython 3.7m Windows x86-64

flaco-0.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

flaco-0.5.0-cp36-cp36m-win_amd64.whl (575.6 kB view details)

Uploaded CPython 3.6m Windows x86-64

flaco-0.5.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file flaco-0.5.0-pp37-pypy37_pp73-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-pp37-pypy37_pp73-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: PyPy, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-pp37-pypy37_pp73-win_amd64.whl
Algorithm Hash digest
SHA256 f5bc78914530a99d7d9bbaa1033fe65bf42a6b0ee98240b08acdd809186e34e7
MD5 718772345bfcb7d7cc55e76ea2256eb4
BLAKE2b-256 28deed5aba9626626f48a5f719003ff7a4aaf5d5ea8d984e6813597fc7fd1ef7

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f58b5361fa00ec9dfe28eff2acf519534302e2860439fcf7d593e13d3d7fef6b
MD5 8d5878b5686df8d3c045b888f236520f
BLAKE2b-256 9f81c7e2a63ee349df36d7f72d76cdc7ed80208e40683b8dbb4a5784a4f3d1c0

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 576.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 4e29d5acfa9b7b3ec2dedba347cc1a477f79dabbcd3e917551b4ebff9b0fc7db
MD5 b2ef1e4c5c87ac2d172a5f6c7b0a0c9e
BLAKE2b-256 53e3ffbe3166fc971544b4c6143b7af064c45e0065df8f44e482212d7d521fa0

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 d040e2be12412ab202ce424a44fa1ce8517ba6b739b143dcc8d938993679065f
MD5 fb0eac7b73c86e37b2b998726dd8f504
BLAKE2b-256 dc02065dbfd05aebdccfa66511d8f82b1469934da6300ab4bd2fd09b1920c3c9

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 576.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ad6ef4e72b7be6ad68118c1cef35f8d8c6e0de3d17191d2dac7e08b0f241ecd2
MD5 48da51a55b69b51cb23b2f3206b8006c
BLAKE2b-256 e24a69a80c551257c7042e87c4243404309558b01e3ece550b0b784630e27139

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5d63cf85bc9a2b50a0064d2ef13def482fe789cb7a57950f945b9354db77df81
MD5 ed65dba983f68acdc9595a7b8ff70b38
BLAKE2b-256 4ba4116684d31b51a0a9e4485a8e66c2b0dc168e345bc068336c3fb6567a7cb8

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 576.7 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9f149edf03ceba019421d4367142a19dbbba8a56aae51c781d520f13f20a2a8c
MD5 8f9730bacec785acc7b072826b26f834
BLAKE2b-256 96e05034bf45badb82ade740f82e1edad0190eed3d7d04f386257ebefc5281bc

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 356ad9c06ff533f287ba374f46874915218079f7909a9ca597a40b39544935f5
MD5 c0e917cc1e91652dcb8f3a8e769d0ab8
BLAKE2b-256 1ebfe652676b7917ebd0e2bac142305d0148569a1d178dbb6d1dcfbcd315d44b

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 575.8 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 99aa02da96c2a0e8d12d47dbf7aee854ea63a779d997c390342494460991e92a
MD5 35c056af4352776555a94ec62b6cb601
BLAKE2b-256 e19fc6d63b9fc72f1190f59323be9968910c508de1018de29fd806a25bc1040d

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 1a5d726b08e502c551664770a5ccf6da6cecc04312bf24a0c23007e12aec55d4
MD5 d44363622fda20d5d76ad134efa15d9e
BLAKE2b-256 1d6df9866c3d5a97687363004708d2f03be48c33be340f725ccde3f3ec916a1c

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 575.6 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.8

File hashes

Hashes for flaco-0.5.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 c95c6e8c87ca2c408730efd739702c86f55dd860743521a06eac2978a7abf929
MD5 d8f1453b8f8848031cd9214423d6c0aa
BLAKE2b-256 69d39c65ee2f739d28e4b4283530e455caef83e62660a1a80341a877053e85d8

See more details on using hashes here.

File details

Details for the file flaco-0.5.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.0-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 240c09c729f85f2c5b16fdfaf5447d92526ef81d0876e32d5c6b4cc8e516b3e1
MD5 e41e112c8b35c24639126fe6faf5293f
BLAKE2b-256 e9060c0949ee31cb264c6f1c2dfb2395ff3a4d8ba6e0a70b7b3f1f827a24fc64

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page