A library for creating even partitions of ordered items.
Project description
Histoptimizer
Overview
Histoptimizer is a Python library and CLI that accepts a DataFrame or ordered list of item sizes, and produces a list of "divider locations" that partition the items as evenly as possible into a given number of buckets, minimizing the variance and standard deviation between the bucket sizes.
JIT compilation and GPU support through Numba provide great speed improvements on supported hardware.
The problem that motivated its creation was: given a list of the ~3117 counties in the U.S., ordered by some attribute (voting averages, population density, median age, etc.), distribute them into a number of buckets of approximately equal population, as evenly as possible.
That job being done, it is of questionable further use. It is fun to work on, though. So.
Usage
Histoptimizer provides two APIs and two command-line tools:
NumPY array partitioner
Several implementations of the partitioning algorithm can be called directly with a list or array of item sizes and a number of buckets. They return an array of divider locations (dividers come after the given item in 1-based indexing, or before the given item in 0-based indexing) and the variance of the given partition.
Pandas Dataframe Partitioner
You can supply a Pandas DataFrame, the name of a size column, a list of bucket sizes, and a column prefix to get a version of the DataFrame with added columns where the value is the 1-based bucket number of the corresponding item partitioned into the number of buckets reflected in the column name.
CLI
The CLI is a wrapper around the DataFrame functionality that can accept and produce either CSV or Pandas JSON files.
Usage: histoptimizer [OPTIONS] FILE ID_COLUMN SIZE_COLUMN PARTITIONS
Given a CSV, a row name column, a size column, sort key, and a number of
buckets, optionally sort the CSV by the given key, then distribute the
ordered keys as evenly as possible to the given number of buckets.
Example:
> histoptimizer states.csv state_name population 10
Output:
state_name, population, partition_10 Wyoming, xxxxxx, 1
California, xxxxxxxx, 10
Options:
-l, --limit INTEGER Take the first {limit} records from the
input, rather than the whole file.
-a, --ascending, --asc / -d, --descending, --desc
If a sort column is provided,
--print-all, --all / --no-print-all, --brief
Output all columns in input, or with
--brief, only output the ID, size, and
buckets columns.
-c, --column-prefix TEXT Partition column name prefix. The number of
buckets will be appended. Defaults to
partion_{number of buckets}.
-s, --sort-key TEXT Optionally sort records by this column name
before partitioning.
-t, --timing / --no-timing Print partitioner timing information to
stderr
-i, --implementation TEXT Use the named partitioner implementation.
Defaults to "dynamic_numba". If you have an
NVidia GPU use "cuda" for better performance
-o, --output FILENAME Send output to the given file. Defaults to
stdout.
-f, --output-format [csv|json] Specify output format. Pandas JSON or CSV.
Defaults to CSV
--help Show this message and exit.
Benchmarking CLI
The Benchmarking CLI can be used to produce comparative performance metrics for various implementations of the algorithm.
Usage: histobench [OPTIONS] PARTITIONER_TYPES [ITEM_SPEC] [BUCKET_SPEC]
[ITERATIONS] [SIZE_SPEC]
Histobench is a benchmarking harness for testing Histoptimizer partitioner
performance.
By Default it uses random data, and so may not be an accurate benchmark for
algorithms whose performance depends upon the data set.
The PARTITIONER_TYPES parameter is a comma-separated list of partitioners to
benchmark, which can be specified as either:
1. A standard optimizer name, or 2. filepath:classname
To specify the standard cuda module and also a custom variant, for example,
Options:
--debug-info / --no-debug-info
--force-jit / --no-force-jit
--report PATH
--sizes-from PATH
--tables / --no-tables
--verbose / --no-verbose
--help Show this message and exit.
JIT SIMD Compilation and CUDA acceleration
Histoptimizer supports Just-in-time compilation for both CPU and NVidia CUDA GPUs using Numba. For larger problems these implementations can be hundreds or thousands of times faster than the pure Python implementation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file histoptimizer-0.9.5.tar.gz
.
File metadata
- Download URL: histoptimizer-0.9.5.tar.gz
- Upload date:
- Size: 10.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68dbd5314604f96d38a4d7cdf7c78673954ae04db3dd21e50c0ba15c4fded171 |
|
MD5 | e6fd4954f8aae81a6e2c494d955f7a1f |
|
BLAKE2b-256 | 2f852fd44e1207b509f00991caaa3d82b6af96573f79aa22842cde9809bcfe7d |
File details
Details for the file histoptimizer-0.9.5-py3-none-any.whl
.
File metadata
- Download URL: histoptimizer-0.9.5-py3-none-any.whl
- Upload date:
- Size: 29.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcaff140d71b370960fc1bcb42fb6c72069c5eb0260961f95e58ef4aa624906b |
|
MD5 | bfd3b901931fd4c133cddba530afb872 |
|
BLAKE2b-256 | 8c200587ad49db9a9d837764aaf9ff04e35e02ad74575d304d986866a746e751 |