Skip to main content

Lightning fast OLAP-style point queries on Pandas DataFrames.

Project description

NanoCube

Lightning fast OLAP-style point queries on Pandas DataFrames.

GitHub license PyPI version PyPI Downloads GitHub last commit unit tests


NanoCube is a minimalistic, in-memory OLAP engine for lightning fast point queries on Pandas DataFrames. As of now, just 27 lines of code are required to transform a Pandas DataFrame into a multi-dimensional OLAP cube. NanoCube shines when point queries need to be executed on a DataFrame, e.g. for financial data analysis, business intelligence or fast web services.

If you believe it would be valuable to extend NanoCube with additional OLAP features and improve its speed - yes, that’s possible - please let me know. You can reach out by opening an issue or contacting me on LinkedIn.

pip install nanocube
import pandas as pd
from nanocube import NanoCube

# create a DataFrame
df = pd.read_csv('sale_data.csv')
value = df.loc[(df['make'].isin(['Audi', 'BMW']) & (df['engine'] == 'hybrid')]['revenue'].sum().item()

# create a NanoCube and run sum aggregated point queries
nc = NanoCube(df)
for i in range(1000):
    value = nc.get('revenue', make=['Audi', 'BMW'], engine='hybrid')

Tip: Only include those columns in the NanoCube setup, that you actually want to query! The more columns you include, the more memory and time is needed for initialization.

df = pd.read_csv('dataframe_with_100_columns.csv')
nc = NanoCube(df, dimensions=['col1', 'col2'], mesaures=['col100'])

Tip: Use dimensions with highest cardinality first. This yields much faster response time when more than 2 dimensions need to be filtered.

nc.get(promo=True, discount=True, customer='4711')  # bad=slower, non-selevtive columns first
nc.get(customer='4711', promo=True, discount=True)  # good=faster, most selective column first 

Lightning fast - really?

For aggregated point queries NanoCube are up to 100x or even 1,000x times faster than Pandas. For this special purpose, NanoCube is even faster than other DataFrame oriented libraries, like Spark, Polars, Modin, Dask or Vaex. If such libraries are a drop-in replacements for Pandas, then you should be able to accelerate them with NanoCube too. Try it and let me know.

NanoCube is beneficial only if multiple point queries (> 10) need to be executed, as the initialization time for the NanoCube needs to be taken into consideration. The more point query you run, the more you benefit from NanoCube.

How is this possible?

NanoCube creates an in-memory multi-dimensional index over all relevant entities/columns in a dataframe. Internally, Roaring Bitmaps (https://roaringbitmap.org) are used. Initialization takes some time, but yields very fast filtering and point queries.

Approach: For each unique value in all relevant dimension columns, a bitmap is created that represents the rows in the DataFrame where this value occurs. The bitmaps can then be combined or intersected to determine the rows relevant for a specific filter or point query. Once the relevant rows are determined, Numpy is used then for to aggregate the requested measures.

NanoCube is a by-product of the CubedPandas project (https://github.com/Zeutschler/cubedpandas) and will be integrated into CubedPandas in the future. But for now, NanoCube is a standalone library that can be used with any Pandas DataFrame for the special purpose of point queries.

What price do I have to pay?

NanoCube is free and MIT licensed. The prices to pay are additional memory consumption, depending on the use case typically 25% on top of the original DataFrame and the time needed for initializing the multidimensional index, typically 50k - 250k rows/sec depending on the number of columns to be indexed and your hardware. The initialization time is proportional to the number of rows in the DataFrame (see below).

You may want to try and adapt the included samples sample.py and benchmarks benchmark.py and benchmark.ipynb to test the behavior of NanoCube on your data.

NanoCube Benchmarks

Using the Python script benchmark.py, the following comparison charts can be created. The data set contains 7 dimension columns and 2 measure columns.

Point query for single row

A highly selective query, fully qualified and filtering on all 7 dimensions. The query will return and aggregates 1 single row. NanoCube is 100x or more times faster than Pandas.

Point query for single row

Point query on high cardinality column

A highly selective, filtering on a single high cardinality dimension, where each member represents ±0.01% of rows. NanoCube is 100x or more times faster than Pandas.

Query on single high cardinality column

Point query aggregating 0.1% of rows

A highly selective, filtering on 1 dimension that affects and aggregates 0.1% of rows. NanoCube is 100x or more times faster than Pandas.

Point query aggregating 0.1% of rows

Point query aggregating 5% of rows

A barely selective, filtering on 2 dimensions that affects and aggregates 5% of rows. NanoCube is consistently 10x faster than Pandas. But you can already see, that the aggregation in Numpy become slightly more dominant.

Point query aggregating 5% of rows

Point query aggregating 50% of rows

A non-selective query, filtering on 1 dimension that affects and aggregates 50% of rows. Here, most of the time is spent in Numpy, aggregating the rows. The more rows, the closer Pandas and NanoCube get as both rely on Numpy for aggregation.

Point query aggregating 50% of rows

NanoCube initialization time

The time required to initialize a NanoCube instance is linear. Depending on the number of dimensions and the cardinality a throughput of 20k to 200k rows/sec can be expected. Almost all time is spent requesting data from Pandas. The initialization of the Roaring Bitmaps taken no time. Likely, a custom file format for NanoCube data would be highly beneficial.

NanoCube initialization time

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanocube-0.1.3.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

nanocube-0.1.3-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file nanocube-0.1.3.tar.gz.

File metadata

  • Download URL: nanocube-0.1.3.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for nanocube-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ab47e9c0e425f1adfe32a6914126ec7cf8ec443ebb0a59246bd6ab1e58fad073
MD5 9e2f9dd84de96363e8ec28f04334e96d
BLAKE2b-256 b404926d14818de99c157b1744911b68b612b42f3d9bc37cc9df5a600780e2ef

See more details on using hashes here.

File details

Details for the file nanocube-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: nanocube-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for nanocube-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1a6eb0afec707b6991b261eedd63e1352d3ac6202f4c2771f7c0aefa22dd15ba
MD5 6467ed904c696cbbf1967871bd81409c
BLAKE2b-256 cb78b870f4f3600a5bb359269dfd7cd25383b068ba1e7bf8c7487c62d2dd5e06

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page