A simple library for performing Square Root Decomposition and Aggregation
Project description
Overview
This is pysquagg, a library containing a data structure intended for expediant computation of aggregations on a collection using Square Root Decomposition. The data structure is an extension of Mo's Algorithm. Please see more details in the associated blog post.
Motivation
The principles behind Mo's Algorithm is interesting and useful, but the implementation is a bit cumbersome. This library is intended to make it easier to use the algorithm in Python, plus introduce dynamic behavior, such that a collection can be modified after the data structure is created, and the corresponding blocks + aggregates are updated accordingly.
Details
The list supplied is split into $\left\lfloor \frac{n}{\left\lfloor \sqrt{n} \right\rfloor} \right\rfloor$ blocks of size $\sqrt{n}$, and the aggregate (based on the supplied aggregation function) is computed for each block. When thePySquagg object is queried with a valid start and end index:
- Pre-computed aggregations for all blocks that are entirely within the range are collected
- Elements not fully contained within a block are iterated over and the aggregation function is applied to them
- The pre-computed aggregations and newly computed aggregations are combined and returned
The end result is a query method that is O($\sqrt{n}$), compared to O(n) for a naive implementation.
Additionally, the PySquagg object can be modified after creation (e.g; append, extend, pop), and the blocks are updated accordingly to always be of size $\sqrt{n}$. The aggregates are also computed on the updated blocks.
API & Usage
The API for using pysquagg is simple, as we're only providing a single class PySquagg:
from pysquagg.pysquagg import PySquagg
pysquagg_instance = PySquagg([1, 2, 3, 4, 5, 6], aggregator_function=sum)
pysquagg_instance.blocks # will print [[1, 2], [3, 4], [5, 6]]
pysquagg_instance.aggregated_values # will print [3, 7, 11]
pysquagg_instance.query(0, 5) # will print 21
pysquagg_instance += [7, 8]
pysquagg_instance.blocks # will print [[1, 2], [3, 4], [5, 6], [7, 8]]
pysquagg_instance.append(9)
pysquagg_instance.blocks # will print [[1, 2, 3], [4, 5, 6], [7, 8, 9]] - the block size has been recomputed from 2 -> 3
pysquagg_instance.aggregated_values # will print [6, 15, 24]
pysquagg_instance.query(0, 8) # will print 45
pysquagg_instance.pop()
pysquagg_instance[2] = -1
pysquagg_instance.blocks # will print [[1, 2], [-1, 4], [5, 6], [7, 8]] - block_size has dropped down from 3 -> 2
Parallel Option
If the aggregation function is computationally heavy, there is a parallel flag (default: False) which will use the ProcessPoolExecutor for Python versions below 3.13, and ThreadPoolExecutor for Python versions which are 3.13 and above and has free threading enabled.
from pysquagg.pysquagg import PySquagg
pysquagg_instance = PySquagg([1, 2, 3, 4, 5, 6], aggregator_function=sum, parallel=True)
# do all the same things as above, except now using the corresponding executor
Note: Extensive benchmarking has not been conducted.
Performance Characteristics
Complexity
| Operation | Average Case Time Complexity | Worst Case Time Complexity |
|---|---|---|
query |
O($\sqrt{n}$) | O($\sqrt{n}$) |
append |
O(1) | O(n) |
insert |
O(n) | O(n) |
pop |
O(n) | O(n) |
extend |
O(m) | O(n + m) |
The main reason for other operations being linear in the worst case is the fact that when the collection is modified, the blocks and aggregates need to be recomputed when the square root of the size of the collection changes. Furthermore, as PySquagg is a subclass of list, some of these performance characteristics are inherent.
Benchmarks
Some preliminary benchmarking can be conducted from scripts in the benchmarks directory. One highlight from comparing query to performing computations on the arbitrary slices (using sum as the aggregator function) is:
| Operation | PySquagg (s) | Naive (s) |
|---|---|---|
query |
0.032 | 1.48 |
As derived from a 2023 Macbook Pro M2, 16GB RAM.
Constraints
The aggregator functions need to be associative and commutative, and the data structure is not thread-safe.
TODO
- Identify if we can reduce the runtime of some operations to be sublinear
- Perform more extensive benchmarking
- Incorporate a mechanism for combining aggregator functions, if someone adds two
PySquaggobjects - Add a
LoosePySquaggclass that does not strictly enforce the sqrt(n), which may have some performance benefits for certain operations such asinsertandpopwhich currently require recomputation of blocks and aggregates
💡 Interested in contributing? Check out the Local Development & Contributions Guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pysquagg-1.2.0.tar.gz.
File metadata
- Download URL: pysquagg-1.2.0.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
363072d5c146dd976d6c340182e72d6b30f452eec5e7c143aaf6f429dd153414
|
|
| MD5 |
1d2a23d2d8e7af5130572084d4c514ff
|
|
| BLAKE2b-256 |
990bf2015b7e7ea21215777ef5b5ac333bec7275c61f6e8ad3196bb8067453b3
|
File details
Details for the file pysquagg-1.2.0-py3-none-any.whl.
File metadata
- Download URL: pysquagg-1.2.0-py3-none-any.whl
- Upload date:
- Size: 6.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
507847d158b65e34b607a2b5b8beae1161da3219dba03d1e08bd0319b91508ba
|
|
| MD5 |
0ec7b4039e4f98bdd372e60f571bd527
|
|
| BLAKE2b-256 |
cdda3d9a1288995117c31373d787405274ec81a45298bd31302c035349ecd343
|