A command-line tool to estimate the number of distinct lines in a file/stream using Chakraborty/Vinodchandran/Meel’s approximation algorithm.

These details have not been verified by PyPI

Project links

Project description

aprxc

A command-line tool (and Python class) to approximately count the number of distinct elements in files (or a stream/pipe) using the “simple, intuitive, sampling-based space-efficient algorithm” by S. Chakraborty, N. V. Vinodchandran and K. S. Meel, as described in their 2023 paper Distinct Elements in Streams: An Algorithm for the (Text) Book.

Motivation: Easier to remember, always faster and (much) less memory-intensive than sort | uniq | wc -l or awk '!a[$0]++' | wc -l. In this implementation’s default configuration results are precise until ~83k unique values (on 64-bit CPUs), with deviations of commonly 0.4-1% afterwards.

Installation

Choose your preferred way to install it from PyPI:

pip install aprxc
uv tool install aprxc

Alternatively, run it in an isolated environment, using pipx run or uvx:

pipx run aprxc --help
uvx aprxc --help

Lastly, as aprxc.py has no dependencies besides Python 3.11+, you can simply download the script, run it, put it your PATH, vendor it, etc.

Features and shortcomings

Easier to remember than the pipe constructs.
20–60% faster than sort/uniq.
30–99% less memory-intensive than sort/uniq (for mid/high-cardinality data)
(Roughly double these numbers when compared against awk.)
Memory usage has a (configurable) upper bound.

Now let's address the elephant in the room: These advantages are bought with an inaccuracy in the reported results. But how inaccurate?

About inaccuracy

In its default configuration the algorithm has a mean inaccuracy of about 0.4%, with outliers around 1%. For example, if the script encounters 10M (10_000_000) actual unique values, the reported count is typically ~40k off (e.g. 10_038_680), sometimes ~100k (e.g. 9_897_071).

However: If the number of encountered actual unique elements is smaller than the size of the internally used set data structure, then the reported counts are exact; only once this limit is reached, the approximation algorithm 'kicks in' and the result will become an approximation.

Here's an overview (highly unscientific!) of how the algorithm parameters 𝜀 and 𝛿 (--epsilon and --delta on the command line) affect the inaccuracy. The default of 0.1 for both values seems to strike a good balance (and a memorable inaccuracy of ~1%). Epsilon is the 'main manipulation knob', and you can see quite good how its value affects especially the maximum inaccuracy.

For this first table I counted 10 million unique 32-character strings, and for each iteration checked the reported count and compared to the actual number of unique items. Mean inacc. is the mean inaccuracy across all 10M steps; max inacc. is the highest deviation encountered; memory usage is the linux tool time's reported maxresident; time usage is wall time.

tool (w/ options)	memory [MiB]	time [s]	mean ina.	max ina.	set size
`sort \| uniq \| wc -l`	1541 (100%)	5.5 (100%)	0%	0%	—
`sort --parallel=16`	1848 (120%)	5.2 ( 95%)	0%	0%	—
`sort --parallel=1`	780 ( 51%)	18.1 (328%)	0%	0%	—
`aprxc --epsilon=0.001`	1044 ( 68%)	4.1 ( 75%)	0%	0%	831_863_138
`aprxc --epsilon=0.01`	1137 ( 74%)	5.5 (100%)	0.001%	0.02%	8_318_632
`aprxc --epsilon=0.05`	78 ( 5%)	2.2 ( 40%)	0.080%	0.42%	332_746
`aprxc` (Python 3.13)	35 ( 2%)	1.8 ( 32%)	0.400%	1.00%	83_187
`aprxc` (Python 3.12)	28 ( 2%)	2.0 ( 36%)	0.400%	1.00%	83_187
`aprxc --epsilon=0.2`	26 ( 2%)	1.8 ( 32%)	0.700%	2.10%	20_797
`aprxc --epsilon=0.5`	23 ( 1%)	1.7 ( 31%)	1.700%	5.40%	3_216
`awk '!a[$0]++'\|wc -l`	3094 (201%)	9.3 (169%)	0%	0%	—

Other time/memory consumption benchmarks

For linux-6.11.6.tar, a medium-cardinality (total: 39_361_138, 43.3% unique) input file:

tool (w/ options)	memory [MiB]	time [s]
`sort \| uniq \| wc -l`	6277 (100%)	41.4 (100%)
`sort --parallel=16`	7477 (119%)	36.7 ( 89%)
`sort --parallel=1`	3275 ( 52%)	158.6 (383%)
`aprxc --epsilon=0.001`	2081 ( 33%)	13.1 ( 32%)
`aprxc --epsilon=0.01`	1364 ( 22%)	15.3 ( 37%)
`aprxc --epsilon=0.05`	105 ( 2%)	8.2 ( 20%)
`aprxc` (Python 3.13)	39 ( 1%)	7.2 ( 17%)
`aprxc` (Python 3.12)	35 ( 1%)	8.1 ( 20%)
`aprxc --epsilon=0.2`	27 ( 0%)	7.2 ( 17%)
`aprxc --epsilon=0.5`	23 ( 0%)	7.2 ( 17%)
`awk '!a[$0]++'\|wc -l`	5638 ( 90%)	24.8 ( 60%)

For cut -f 1 clickstream-enwiki-2024-04.tsv, a low-cardinality (total: 34_399_603, unique: 6.4%), once via temporary file¹, once via pipe²:

tool (w/ options)	¹mem [MiB]	¹time [s]	²mem [MiB]	²time [s]
`sort \| uniq \| wc -l`	4823 (100%)	11.6 (100%)	14 (100%)	50.3 (100%)
`sort --parallel=16`	5871 (122%)	11.9 (103%)	14 (100%)	48.7 ( 97%)
`sort --parallel=1`	2198 ( 46%)	50.5 (436%)	10 ( 71%)	48.7 ( 97%)
`aprxc --epsilon=0.001`	214 ( 4%)	10.8 ( 93%)	215 (1532%)	10.7 ( 21%)
`aprxc --epsilon=0.01`	215 ( 4%)	10.5 ( 91%)	215 (1534%)	10.6 ( 21%)
`aprxc --epsilon=0.05`	73 ( 2%)	8.2 ( 71%)	73 (524%)	7.7 ( 15%)
`aprxc` (Python 3.13)	35 ( 1%)	6.4 ( 55%)	36 (254%)	6.5 ( 13%)
`aprxc` (Python 3.12)	29 ( 1%)	7.4 ( 64%)	29 (204%)	7.2 ( 14%)
`aprxc --epsilon=0.2`	27 ( 1%)	5.8 ( 50%)	27 (189%)	5.8 ( 11%)
`aprxc --epsilon=0.5`	23 ( 0%)	6.0 ( 52%)	23 (164%)	6.1 ( 12%)
`awk '!a[$0]++'\|wc -l`	666 ( 14%)	15.7 (136%)	666 (4748%)	14.4 ( 29%)

Is it useful?

You have to accept the inaccuracies, obviously. But if you are working exploratory and don't care about exact number or plan to round or throw them away anyway; or if or you are in a memory-constrained situation and need to deal with large input files or streaming data; or if you just cannot remember the multi-command pipe alternatives, then this might be a tool for you.

The experimental 'top most common' feature

I've added a couple of lines of code to support a 'top most common' items feature. An alternative to the sort | uniq -c | sort -rn | head-pipeline or Tim Bray's nice topfew tool (written in Go).

It kinda works, but…

The counts are good, even surprisingly good, as for the whole base algorithm, but definitely worse and not as reliable as the nice 1%-mean-inaccuracy for the total count case.
I lack the mathematical expertise to prove or disprove anything about that feature.
If you ask for a top 10 (-t10 or --top 10), you mostly get what you expect, but if the counts are close the lower ranks become 'unstable'; even rank 1 and 2 sometimes switch places etc.
Compared with topfew (I wondered if this approximation algorithm could be an optional flag for topfew), this Python code is impressively close to the Go performance, especially if reading a lot of data from a pipe. Unfortunately, I fear that this algorithm is not parallelizable. But I leave that, and the re-implementation in Go or Rust, as an exercise for the reader :)
Just try it!

Command-line interface

usage: aprxc [-h] [--top [X]] [--size SIZE] [--epsilon EPSILON]
             [--delta DELTA] [--cheat] [--count-total] [--verbose] [--version]
             [--debug]
             [path ...]

Aproximately count the number of distinct lines in a file or pipe.

positional arguments:
  path                  Input file path(s) and/or '-' for stdin (default:
                        stdin)

options:
  -h, --help            show this help message and exit
  --top [X], -t [X]     EXPERIMENTAL: Show X most common values. Off by
                        default. If enabled, X defaults to 10.
  --size SIZE, -s SIZE  Expected (estimated) total number of items. Reduces
                        memory usages, increases inaccuracy.
  --epsilon EPSILON, -E EPSILON
  --delta DELTA, -D DELTA
  --cheat               Improve accuracy by tracking 'total seen' and use it
                        as upper bound for result.
  --count-total, -T     Count number of total seen values.
  --verbose, -v
  --version, -V         show program's version number and exit
  --debug               Track, calculate, and display various internal
                        statistics.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.2

Apr 26, 2025

2.0.0

Nov 2, 2024

1.1.3

Aug 26, 2024

1.1.2

Aug 25, 2024

1.1.1

Aug 24, 2024

1.1.0

Aug 23, 2024

1.0.0

Aug 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aprxc-2.0.2.tar.gz (12.4 kB view details)

Uploaded Apr 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aprxc-2.0.2-py3-none-any.whl (13.3 kB view details)

Uploaded Apr 26, 2025 Python 3

File details

Details for the file aprxc-2.0.2.tar.gz.

File metadata

Download URL: aprxc-2.0.2.tar.gz
Upload date: Apr 26, 2025
Size: 12.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for aprxc-2.0.2.tar.gz
Algorithm	Hash digest
SHA256	`a40a9bad4d9a46c966e7646c64946b6ad72a43fd5a14729047e989be129ef019`
MD5	`5678be1b4fa338fb78810ee1c34107f7`
BLAKE2b-256	`82f0c0716bafb3a301bda1415e3e7b0f4f0ceb29f95a64d8e9d2691b3f6b5618`

See more details on using hashes here.

File details

Details for the file aprxc-2.0.2-py3-none-any.whl.

File metadata

Download URL: aprxc-2.0.2-py3-none-any.whl
Upload date: Apr 26, 2025
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for aprxc-2.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65938fe1622cb6ab15ba1c61f9b3ead123e9c89e62545e3f9477c476dde36f61`
MD5	`6915d4df1f42913213babaddd640c03d`
BLAKE2b-256	`8e3a50af808c9614964d9dd2e161a801876341dece18c711e52545995f23f919`

See more details on using hashes here.

aprxc 2.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

aprxc

Installation

Features and shortcomings

About inaccuracy

Other time/memory consumption benchmarks

Is it useful?

The experimental 'top most common' feature

Command-line interface

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes