Skip to main content

Adaptive ultra-sort: auto-dispatching, correctness-first sorting that beats np.sort by a mile on the array fast path

Project description

quill-sort

Adaptive Ultra-Sort — Quill auto-dispatches to the fastest correct sort for your data and your machine, and never loses to the baseline.

import numpy as np
import quill

# The list API: a drop-in for sorted() / list.sort(), ~3-4x faster on numeric.
quill.quill_sort([3, 1, 4, 1, 5, 9])              # → [1, 1, 3, 4, 5, 9]
quill.quill_sort(records, key=lambda r: r['age']) # sort objects by a key

# The array API: the "by a mile" path — dispatches to Rust radix / GPU / polars.
a = np.random.randint(0, 2**40, 20_000_000)
quill.sort_array(a)                                # ~5x faster than np.sort

The one thing to know: list API vs. array API

Quill has two doors, and which one you use determines how big the win is.

API Input / output Speed (numeric, this machine) Why
quill_sort(list) list in, list out matches np.sort, ~3.8x vs sorted() / list.sort() It does np.asarray(...) then .tolist() around a fast kernel. Those two conversions dominate the runtime (Amdahl's law), so the kernel speedup is mostly hidden — you get np.sort-class throughput, not the "by a mile" number.
quill.sort_array(ndarray) ndarray in, ndarray out ~5.3x int64, ~3.3x float64, ~4x GPU vs np.sort Zero conversions. It dispatches the raw buffer straight to the fastest installed backend. This is where Quill actually pulls away from numpy.

Plain version: if your data starts and ends as a Python list, the conversion wall caps the win — quill_sort is np.sort-fast and ~4x faster than the built-in sorted(), which is already a great deal, but it is not 5x. If you keep your data in numpy arrays and call sort_array, you get the full multi-threaded / GPU speedup. Don't believe any library that claims a 5x list-in / list-out sort — the conversion alone forbids it.

All numbers below are measured on this development box (Windows 11, 28-core CPU, RTX 4060 Ti, numpy 2.4.6) with a freshly-shuffled array on every timed run, so they reflect real cold-sort throughput, not warm/already-sorted artifacts. Your mileage will vary with CPU, GPU, array size, and dtype.


Installation

pip install quill-sort              # core — ZERO required deps, always correct
pip install quill-sort[fast]        # + numpy + psutil (the baseline fast path)
pip install quill-sort[polars]      # + polars  (no-compile multi-threaded sort)
pip install quill-sort[gpu]         # + cupy    (NVIDIA GPU sort)
pip install quill-sort[all]         # numpy + pandas + psutil + polars

The compiled Rust backend (the ~5x path) ships as a pre-built binary wheel for common platforms — pip install quill-sort just gets it, no Rust toolchain needed. If no wheel matches your platform, pip falls back to the pure-Python sdist and Quill transparently uses the next-fastest backend (GPU → polars → numpy → np.sort). The install never fails for lack of a compiler.


How it picks (the dispatch chain)

sort_array() profiles the array and walks a priority chain, taking the first backend that is installed and supports the array. Every step has a measured crossover (min_n) below which it isn't worth it, and any backend error falls straight back to np.sort — correctness is never at risk.

sort_array(ndarray)
        │
        ▼
  eligible?  (1-D, dtype kind i/u/f, itemsize ≤ 8, value-only, C-contiguous)
        │ no ──────────────────────────────────────────────► np.sort
        │ yes
        ▼
  dense bounded int64/uint64?  ──► counting sort (np.bincount)   ~1.7-2.8x, single-thread
        │ no
        ▼
  ┌──────────────────────────────────────────────────────────┐
  │  try, in priority order, the first available + supporting │
  │    rust_voracious   (compiled radix, n ≥ 1M)              │
  │    cupy_gpu         (GPU, n ≥ 2M, fits in free VRAM)      │
  │    polars           (no-compile MT sort, n ≥ 200k)        │
  │    numpy_parallel   (int-only partition sort, n ≥ 5M)     │
  └──────────────────────────────────────────────────────────┘
        │ any error / nothing eligible
        ▼
  np.sort   ← the floor; always correct, never slower than numpy

NaN is stripped before any backend runs and re-appended at the end (numpy convention), so a backend that can't order NaN never sees one. Descending is a post-sort reverse. quill.available_backends() shows what your machine will actually use, in this order.


Backends

Backend How to get it Measured (int64, vs np.sort) Notes
rust_voracious pip install quill-sort (bundled wheel) ~5.3x (float64 ~3.3x) Compiled PyO3 + voracious_radix_sort, multi-threaded radix. Fastest measured path; leads the chain.
cupy_gpu pip install quill-sort[gpu] ~4x (float64 ~2.5x) GPU radix via CuPy. Counts the host→device→host round trip and still wins for large arrays that fit in VRAM.
polars pip install quill-sort[polars] ~2.3x (float64 ~1.7x) Delegates to polars' multi-threaded Rust sort. No compiler needed — the "fast for everyone" fallback.
numpy_parallel pip install quill-sort[fast] ~1.1x (int only) Thread-parallel np.partition sample-sort. Small but reliable int win; deliberately uses few workers (the gain saturates at memory bandwidth).
counting sort pip install quill-sort[fast] ~1.7-2.8x np.bincount for dense bounded int64/uint64. O(n+k), single-threaded, beats every comparison backend in its niche.
np.sort always (with numpy) 1.0x (the floor) The never-lose baseline. Reached on any error or ineligible dtype.

Without numpy at all, Quill still sorts correctly via the stdlib Timsort — it is simply slower. Correctness has zero required dependencies.


The array API — sort_array()

import numpy as np
import quill

a = np.random.randint(0, 2**40, 20_000_000)

s = quill.sort_array(a)                      # sorted COPY (a untouched), fastest backend
quill.sort_array(a, inplace=True)            # sort a IN PLACE, returns a
d = quill.sort_array(a, descending=True)     # reverse order

quill.available_backends()
# → ['rust_voracious', 'cupy_gpu', 'polars', 'numpy_parallel']  (priority order)

sort_array matches np.sort exactly — including negatives, mixed int/float promotion, and NaN-to-end — and is never slower than np.sort on any input.

Top-k — quill_topk()

When you only need the k smallest (or largest), don't sort the whole array. quill_topk uses numpy's argpartition (introselect, O(n)) instead of a full O(n log n) sort:

quill.quill_topk(scores, 10)                  # 10 smallest, sorted ascending
quill.quill_topk(scores, 10, largest=True)    # 10 largest,  sorted descending
quill.quill_topk(rows, 5, key=lambda r: r.size)

Measured ~8x faster than sorted(data, reverse=True)[:10] for k=10 over 10M elements on this machine.

analyze() — inspect the profile

quill.analyze([3, 1, 4, 1, 5, 9])
# {'n': 6, 'dtype': 'int_pos', 'presorted': False, 'dense': True, ...}

The list API — quill_sort() / quill_sorted()

A drop-in replacement for sorted() / list.sort() that routes numeric lists through the fast numeric kernel and everything else through Timsort. It returns a list, so it pays the conversion tax described above — expect np.sort-class speed and ~3-4x over the built-ins on numeric data, not the array-API number.

quill.quill_sort([3, 1, 4, 1, 5, 9])              # in-place, returns the list
quill.quill_sort(data, key=lambda x: x['score'])  # objects via key
quill.quill_sort(data, reverse=True)              # descending
quill.quill_sort(data, inplace=False)             # return a new list, leave input alone
quill.quill_sort(data, parallel=True)             # force multi-core
quill.quill_sort(data, stable=False)              # unstable, max speed on numeric
result = quill.quill_sorted(iterable)             # non-mutating, mirrors sorted()
quill.quill_sort(
    data,                        # list, generator, range, ndarray, Series, DataFrame
    key=None,                    # sort key function (like sorted())
    reverse=False,               # descending order
    inplace=True,                # mutate in place (False → return a new list)
    parallel=False,              # use multiple cores (auto on big numeric data)
    high_performance_mode=False, # skip the interactive prompt on the external-sort path
    silent=False,                # suppress status output
    stable=True,                 # True = matches sorted() exactly; False = max speed
    stats=False,                 # return (sorted_list, stats_dict)
)

stable parameter

stable=True (default) guarantees equal elements keep their original relative order — identical to Python's sorted(). stable=False allows the benchmark-optimal unstable kernels on numeric data for extra speed when order among equal elements doesn't matter.

quill.quill_sort(data)                 # stable, safe for all use cases
quill.quill_sort(data, stable=False)   # unstable, faster for large int32/int64

stats — timing

result, stats = quill.quill_sort(data, stats=True)
# stats = {'time_ms': 12.3, 'n': 1000000}

The never-lose guarantee

Every fast path in Quill is wrapped so that any failure — a missing backend, a GPU OOM, a compiled-extension panic, an unsupported dtype — falls back to np.sort (or stdlib Timsort if numpy is absent). The Rust extension is built with panic = "unwind", so even a native panic becomes a catchable Python exception and the process never aborts. You can never get a result that is wrong or a sort that is slower than the numpy baseline.

# To debug a backend instead of silently falling back, surface the error:
QUILL_BACKEND_DEBUG=1 python your_script.py

Correctness contract

Quill's output matches sorted() / np.sort on every supported input:

None values sort to the end (start with reverse=True):

quill.quill_sort([3, None, 1, None, 2])
# → [1, 2, 3, None, None]

NaN values in float data go to the end (start with reverse=True), matching numpy's convention:

quill.quill_sort([3.0, float('nan'), 1.0, 2.0])
# → [1.0, 2.0, 3.0, nan]

Mixed int/float lists are promoted to float64 and take the fast float path (not the slow object fallback):

quill.quill_sort([1, 2.5, 3, 4.0])
# → [1, 2.5, 3, 4.0]

Negative integers are handled natively (two's-complement radix; no object fallback), and keyed sorts are stable by default.


Supported types

  • int, float, str, bytes — native fast paths
  • Mixed int + float — promoted to float64
  • Negative integers — handled natively
  • None values — sorted to the end (start with reverse=True)
  • float('nan') — sorted to the end (start with reverse=True)
  • numpy.ndarray — sorted via the backend chain (use sort_array for the full win)
  • pandas.Series — sorted, returned as a new Series
  • pandas.DataFrame — sorted by column(s) via key='column_name'
  • Any generator / iterator — materialised to a list, sorted, returned

Plugin system

Teach Quill to sort your own types. A plugin declares which types it handles and a prepare() that converts them to something Quill sorts natively, plus an optional postprocess to rebuild your objects.

from quill import register_plugin, QuillPlugin

class MyPlugin(QuillPlugin):
    handles = (MyCustomClass,)
    name    = "my_custom_class"

    @staticmethod
    def prepare(data, key, reverse):
        items       = [x.value for x in data]
        postprocess = lambda sorted_vals: [MyCustomClass(v) for v in sorted_vals]
        return items, key, postprocess   # (list_to_sort, key_fn, postprocess_fn)

register_plugin(MyPlugin)
quill.quill_sort(list_of_my_objects)

Built-in plugins cover numpy.ndarray, pandas.Series, pandas.DataFrame, range, and any generator/iterator. You can also drop in a custom backend (not just a type plugin) with quill._backends.register_backend(...) if you have a faster kernel for some dtype.


CLI / demo / setup

python -m quill       # animated benchmark demo across data types
quill                 # same, if installed via pip
quill setup           # calibration wizard: measures the parallel/GPU crossovers
                      #   on YOUR box and persists them to ~/.quill/config.json

Quill also ships a small visualizer/wizard and add-on helpers layered on the core engine (the calibration wizard above, profile inspection via analyze(), and the quill_topk add-on). They are optional conveniences — the core sort API needs none of them.


How it works (the list path, in detail)

For a list, Quill profiles the data at intake and routes to the best strategy:

Data shape Strategy Complexity
Dense bounded integer range np.bincount counting sort O(n + k)
int64 / float64 (value-only) fastest backend via the array chain ~O(n) radix / O(n log n)
Mixed int/float promote to float64, then numeric path O(n log n)
Strings / bytes Python Timsort O(n log n)
Objects with a key extract keys → numpy argsort (stable) O(n log n)
Nearly sorted (asc_ratio > 0.90) Timsort bypass ~O(n)
Larger than RAM external radix-partition + parallel bucket sort disk-bound

The conversion to/from numpy is the fixed cost that caps the list-path speedup — see the very first table. For maximum throughput, keep your data in arrays and use sort_array.


Performance tuning

  • Install the right extra. [fast] (numpy + psutil) is the baseline; add [polars] for a no-compile multi-threaded sort, or [gpu] if you have an NVIDIA card. The bundled Rust wheel already gives you the top path.
  • Use sort_array on arrays. It skips the conversion wall entirely.
  • stable=False on the list path for extra speed on numeric data when you don't need stable ordering.
  • quill setup measures your machine's crossovers so auto-parallel and GPU engage exactly where they win.
  • Install psutil for accurate RAM sensing (avoids spurious external-sort triggers); without it, Quill assumes 2 GB available.

Troubleshooting

  • "quill_sort(list) isn't 5x faster." It can't be — the asarray/tolist round trip dominates (Amdahl). It is np.sort-fast and ~3-4x over sorted(). For the big win, keep arrays and call sort_array.
  • "sort_array isn't using the GPU / Rust path." Check quill.available_backends(). The backend may not be installed, the array may be below its crossover (min_n), or the dtype/itemsize may be ineligible. Set QUILL_BACKEND_DEBUG=1 to surface the real reason instead of falling back silently.
  • "GPU path errors on a huge array." It only engages when the array fits in free VRAM with headroom; otherwise it falls back to the CPU radix or np.sort.
  • "Equal elements reordered." Use stable=True (the default). stable=False is explicitly allowed to reorder equal elements.
  • "External sort triggered unexpectedly." Install psutil for accurate RAM sensing; pass high_performance_mode=True to skip the interactive prompt.

Development

git clone https://github.com/invariant-games/quill.git
cd quill
pip install -e ".[all,dev]"
pytest                          # full suite
pytest -m slow                  # parallel / large-data tests
python -m quill                 # animated benchmark demo

# Build the compiled Rust backend locally (optional; CI ships wheels):
cd rustext && maturin develop --release

The CI workflow .github/workflows/wheels.yml builds the rustext crate into abi3 binary wheels (one wheel per platform covers Python ≥ 3.8) for Windows, manylinux x86_64/aarch64, and macOS x86_64/arm64, plus a pure-Python sdist as the never-fail-install fallback.


Requirements

  • Python 3.8+
  • numpy — optional but strongly recommended (pip install quill-sort[fast])
  • psutil — optional, for accurate RAM sensing
  • polars — optional backend (pip install quill-sort[polars])
  • cupy — optional GPU backend (pip install quill-sort[gpu], NVIDIA only)
  • pandas — optional (Series/DataFrame support)

A C/Rust toolchain is not required to install — the compiled backend is shipped as a pre-built wheel, with a pure-Python sdist fallback.


License

MIT — by Isaiah Tucker

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quill_sort-6.0.6.tar.gz (89.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quill_sort-6.0.6-py3-none-any.whl (88.2 kB view details)

Uploaded Python 3

File details

Details for the file quill_sort-6.0.6.tar.gz.

File metadata

  • Download URL: quill_sort-6.0.6.tar.gz
  • Upload date:
  • Size: 89.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for quill_sort-6.0.6.tar.gz
Algorithm Hash digest
SHA256 237687a92edc2e2a331fe48131e4a4f63db20f7fbacd1cfe2a727549c49ae1fd
MD5 ea8f8307616db6b4442b76624cd38df6
BLAKE2b-256 8a4b1880e6df643cbc70c058a475167892c5ee64091ea360421067c058b4e5fc

See more details on using hashes here.

File details

Details for the file quill_sort-6.0.6-py3-none-any.whl.

File metadata

  • Download URL: quill_sort-6.0.6-py3-none-any.whl
  • Upload date:
  • Size: 88.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for quill_sort-6.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 f8040d60aaefcb0e5d42065cda54b52032f8998207f00d059a3d1db60a881621
MD5 00f7e48e30c2e67d7ed0348e5f7da6a7
BLAKE2b-256 a324d9e46bdfe7f49c5ba37d76482dc4a3af3b1ee6ba3b84d7e14a04e863c1a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page