Skip to main content

Fast and Efficient PostgreSQL data into numpy/pandas

Project description

flaco

Code Style CI PyPI PyPI - Wheel Downloads

Perhaps the fastest* and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Have a gander at the initial benchmarks 🏋

flaco tends to use nearly ~3-6x less memory than standard pandas.read_sql and about ~2-3x faster. However, it's probably 50x less stable at the moment. 😜

To whet your appetite, here's a memory profile between flaco, connectorx and pandas.read_sql on a table with 1M rows with columns of various types. (see test_benchmarks.py) *If the data, specifically integer types, has null values, you can expect a bit lower savings what you see here; therefore (hot tip 🔥), supply fill values in your queries where possible via coalesce.

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   118     97.9 MiB     97.9 MiB           1   @profile
   119                                         def memory_profile():
   120     97.9 MiB      0.0 MiB           1       stmt = "select * from test_table"
   121                                         
   122    354.9 MiB    257.0 MiB           1       _cx_df = cx.read_sql(DB_URI, stmt, return_type="pandas")
   123                                         
   124    354.9 MiB      0.0 MiB           1       with Database(DB_URI) as con:
   125    533.9 MiB    178.9 MiB           1           data = read_sql(stmt, con)
   126    541.2 MiB      7.3 MiB           1           _flaco_df = pd.DataFrame(data, copy=False)
   127                                         
   128    545.3 MiB      4.1 MiB           1       engine = create_engine(DB_URI)
   129   1680.9 MiB   1135.5 MiB           1       _pandas_df = pd.read_sql(stmt, engine)

Example

from flaco.io import read_sql, Database


uri = "postgresql://postgres:postgres@localhost:5432/postgres"
stmt = "select * from my_big_table"

with Database(uri) as con:
    data = read_sql(stmt, con)  # dict of column name to numpy array

# If you have pandas installed, you can create a DataFrame
# with zero copying like this:
import pandas as pd
df = pd.DataFrame(data, copy=False)

# If you know the _exact_ rows which will be returned
# you can supply n_rows to perform a single array 
# allocation without needing to resize during query reading.
with Database(uri) as con:
    data = read_sql(stmt, con, 1_000)

Notes

Is this a drop in replacement for pandas.read_sql?

No. It varies in a few ways:

  • It will return a dict of strnumpy.ndarray objects. But this can be passed with zero copies to pandas.DataFrame
  • When querying integer columns, if a null is encountered, the array will be converted to dtype=object and nulls from PostgreSQL will be None. Whereas pandas will convert the underlying array to a float type; where nulls from postgres are basically numpy.nan types.
  • It lacks basically all of the options pandas.read_sql has.

How does this compare with connectorx?

Connectorx is an exceptionally impressive library, and more mature than flaco. They have much wider support for a range of data sources, while flaco only supports postgres for now.

Performance wise, benchmarking seems to indicate flaco is generally more performant in terms of memory, but connectorx is faster when temporal data types (time, timestamp, etc) are used. If it's pure numeric dtypes, flaco is faster and more memory efficient.

Connectorx will make precheck queries to the source before starting to download data. Depending on your application, this can be a significant bottleneck. Specially, some data warehousing queries are very expensive and even doing a LIMIT 1, will cause significant load on the source database.

Flaco will not run any precheck queries. However, you can supply either n_rows or size_hint to flaco.io.read_sql to give either exact, or a hint to reduce the number of allocations/resizing of arrays during data loading.

When in doubt, it's likely you should choose connectorx as it's more mature and offers great performance.

Words of caution

While it's pretty neat this lib can allow faster and less resource intensive use of numpy/pandas against PostgreSQL, it's in early stages of development and you're likely to encounter some sharp edges which include, but not limited to:

  • 📝 Poor/non-existant error messages
  • 💩 Core dumps
  • 🚰 Memory leaks (although I think most are handled now)
  • 🦖 Almost complete lack of exception handling from underlying Rust/C interface
  • 📍 PostgreSQL numeric type should ideally be converted to decimal.Decimal but uses f64 for now; potentially loosing precision. Note, this is exactly what pandas.read_sql does.
  • ❗ Might not handle all or custom arbitrary PostgreSQL types. If you encounter such types, either convert them to a supported type like text/json/jsonb (ie select my_field::text ...), or open an issue if a standard type is not supported.

License

Why did you choose such lax licensing? Could you change to a copy left license, please?

...just kidding, no one would ask that. This is dual licensed under Unlicense and MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl (1.1 MB view details)

Uploaded PyPy Windows x86-64

flaco-0.5.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.0 MB view details)

Uploaded PyPy manylinux: glibc 2.12+ x86-64

flaco-0.5.1-cp310-cp310-win_amd64.whl (573.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

flaco-0.5.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

flaco-0.5.1-cp39-cp39-win_amd64.whl (573.8 kB view details)

Uploaded CPython 3.9 Windows x86-64

flaco-0.5.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

flaco-0.5.1-cp38-cp38-win_amd64.whl (573.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

flaco-0.5.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

flaco-0.5.1-cp37-cp37m-win_amd64.whl (573.2 kB view details)

Uploaded CPython 3.7m Windows x86-64

flaco-0.5.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

flaco-0.5.1-cp36-cp36m-win_amd64.whl (578.0 kB view details)

Uploaded CPython 3.6m Windows x86-64

flaco-0.5.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

File details

Details for the file flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: PyPy, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-pp37-pypy37_pp73-win_amd64.whl
Algorithm Hash digest
SHA256 b45fc797adc024a3ba3764750d0bcad2dfd263b285cac9de693f75b8a1aa976a
MD5 bf919e57eaa6576932e8bc516dc2033b
BLAKE2b-256 29852413b5dd96961a21ef6295566f456c2e223d9ab9a84a7de19164c6080135

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-pp37-pypy37_pp73-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 331b2427b81447a14569ac750613701078836931ede082e9a9fcec4c5bd53d8c
MD5 af8e8f28430533402e73fb218a66ed81
BLAKE2b-256 d2683714e14bd2b5da60ced64db6deaed03d74748ce63b33a6c83008ba29f70c

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 573.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ca5e94d572251f1ee789ac3801d6c93583d8ac7ab3a7ecf329cb03f020f8defe
MD5 67e0b03d5d1131bc2da79b7dde5ea219
BLAKE2b-256 40fe1655b63bcb99924268dd6763372d2df303bcbb417526c3e579fe8675fbd7

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 efcdb2e4d6ef4c47a9a642b1242aace8c76c3e257ae96fa316fc3d4c9bd473e4
MD5 c153fb9482d04db931b85587480d3e04
BLAKE2b-256 e64b1388310d2a44ca9b179896e33caa981a65b13fb8de3ff8d70611ac8afe38

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 573.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0b976423ff8bcae6e528f6e25fe8bce30cd030033050815ff275e3f542f23055
MD5 a9e360d4c102829f8f63d4de1856d1bc
BLAKE2b-256 acfc0e6e47aa60cbe764a6a2273f6fde6910bd9a69396d387ac7eda38ab47a51

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 cb5a4882536bd0e864df9e9a91986177ee49e95cc88feabf91719a0ed4b4cd63
MD5 7d8c97296ff98b0dded9aa5dbc71ff9c
BLAKE2b-256 1a23a4a28ad86b8f9949b6cecbbd5dc68fef6074849347310337f68123c9804d

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 573.7 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 087a7336050ade71a0e7bee2abfa3c8ccb2b1ab173994e49dbddc24a80f3693c
MD5 3834f4d7e0df4aff179ccb4bbe12f6fa
BLAKE2b-256 b34dad6506d815a09edbcf6610bde26f2762bcb46585baee62f33d274f2ba088

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 177ea7397115581202a48d141ea25f3a3b6064901f1dd8a0dd4b80c5b103ee63
MD5 5467e7a29be22a585a3d94a93039282d
BLAKE2b-256 bd347580e6050d524651ad37b68d30b81a6d6bd3ca2dca61c8fa0eb21157704b

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 573.2 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 0280f0af7e8184cf5a3dbadcf92086e824fdd8cf02dbfd02b7a1a58d2c1b3ace
MD5 271dc5ba8b3388bb0fa318fca75871fd
BLAKE2b-256 eb7db58b05d0a89925d3a385b42d94adb65f8594524a5c628fe8185b437d85f3

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 1dc4145b0a25a6f77480c89c3e021dc75f993fcb12a6643705885e2acdda1e67
MD5 bfeb09ec73452025a1690a0adac16ade
BLAKE2b-256 6c0770caa1dc59f2763375003dfc3a4e18cadd1475b909f7b0bdcd1f41b639c0

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: flaco-0.5.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 578.0 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for flaco-0.5.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 2e7a5161118433ab2067d3470a060377f0f0c3a2b6f81aa4d34ee9a1d2aba8cf
MD5 5850d5e6e1b6dfe2d067730e70310ca6
BLAKE2b-256 8e316df633a11203b96fc2d473dae82b5c4159a9fa5ddffdcd06f7f091cc0f5c

See more details on using hashes here.

File details

Details for the file flaco-0.5.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for flaco-0.5.1-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 04499ba0c84b2a91fa6198650ffbaa3b5ecf320bb4bb488b4f29bb4f0553041f
MD5 0109a6f950108e7e36e64b198a10ac53
BLAKE2b-256 2d527793215e89c3037f2d5821462cc7ee139ca25558cd4a8696dd8ce43ce64c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page