A package to perform optimal univariate microaggregation for various cost functions

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

microagg1d

A Python library which implements different techniques for optimal univariate microaggregation. The two main parameters that determine the runtime are the length n of the input array and minimal class size k. This package offers both O(n) (fast for large k) and O(kn) (fast for small k) algorithms.

The code is written in Python and relies on the numba compiler for speed.

Requirements

microagg1d relies on numpy and numba which currently support python 3.8-3.11.

Installation

microagg1d is available on PyPI, the Python Package Index.

$ pip3 install microagg1d

Example Usage

from microagg1d import univariate_microaggregation

x = [5, 1, 1, 1.1, 5, 1, 5.1]

clusters = univariate_microaggregation(x, k=3)

print(clusters)   # [1 0 0 0 1 0 1]

# explicitly choose method / algorithm
clusters2 = univariate_microaggregation(x, k=3, method="wilber")

print(clusters2)   # [1 0 0 0 1 0 1]

# choose a different cost (sae / sse / roundup / rounddown / maxdist)
clusters3 = univariate_microaggregation(x, k=3, cost="sae")

print(clusters3)   # [1 0 0 0 1 0 1]

Important notice: On first import the the code is compiled once which may take about 30s. On subsequent imports this is no longer necessary and imports are almost instant.

Tests

Tests are in tests/.

# Run tests
$ python3 -m pytest .

Method Details

Currently the package implements the following methods:

"simple" [O(nk), faster for small k]
"wilber" [O(n), faster for larger k]
"galil_park" [O(n), fewer calls to SMAWK]
"staggered" [fastest O(n)]

By default, the package switches between the simple and wilber method depending on the size of k.

Both methods rely on a prefix sum approach to compute the cluster cost. As the prefix sums tend to become very large quite quickly, a slightly slower but numerically more robust method is chosen by default. If your data is small, or you don't need the numeric stability then you may choose to also opt out of stable.

License

The code in this repository has an BSD 2-Clause "Simplified" License.

See LICENSE.

References

Hansen, S.L. and Mukherjee, S., 2003. A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering
Wilber, R., 1988. The concave least-weight subsequence problem revisited. Journal of Algorithms, 9(3), pp.418-425.
Galil, Z. and Park, K., 1989. A linear-time algorithm for concave one-dimensional dynamic programming.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.2

Jan 4, 2024

0.3.1

Oct 25, 2023

0.2.0

Mar 28, 2023

0.1.2

Mar 21, 2023

0.1.0

Mar 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microagg1d-0.3.2.tar.gz (363.0 kB view hashes)

Uploaded Jan 4, 2024 Source

Built Distribution

microagg1d-0.3.2-py3-none-any.whl (24.5 kB view hashes)

Uploaded Jan 4, 2024 Python 3

Hashes for microagg1d-0.3.2.tar.gz

Hashes for microagg1d-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`0c8112c9b06bae9e3807428f4c10bfad90b398ae373e2eddc9858faa97f2b3cc`
MD5	`f076c13583348e2d5379fc7cbf783e3e`
BLAKE2b-256	`54d6869351975e6487b5c2fc12a062ab6c4142413044d0b3a6ba9c7288ef104c`

Hashes for microagg1d-0.3.2-py3-none-any.whl

Hashes for microagg1d-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a3f058b2eb1567f82e2f7fd54706448cf4a4e18ef779b207449cccb9619a04ad`
MD5	`d9f633197e5f8fca80c9953b2ce5bcf7`
BLAKE2b-256	`1c1a67629f3bce44234634797b386f74f7408c9b0f3c6d1ac1bf6960f45fabdf`