Adaptive ultra-sort: auto-dispatching, correctness-first sorting that beats np.sort by a mile on the array fast path
Project description
quill-sort
Adaptive Ultra-Sort — Quill auto-dispatches to the fastest correct sort for your data and your machine, and never loses to the baseline.
import numpy as np
import quill
# The list API: a drop-in for sorted() / list.sort(), ~3-4x faster on numeric.
quill.quill_sort([3, 1, 4, 1, 5, 9]) # → [1, 1, 3, 4, 5, 9]
quill.quill_sort(records, key=lambda r: r['age']) # sort objects by a key
# The array API: the "by a mile" path — dispatches to Rust radix / GPU / polars.
a = np.random.randint(0, 2**40, 20_000_000)
quill.sort_array(a) # ~5x faster than np.sort
The one thing to know: list API vs. array API
Quill has two doors, and which one you use determines how big the win is.
| API | Input / output | Speed (numeric, this machine) | Why |
|---|---|---|---|
quill_sort(list) |
list in, list out |
matches np.sort, ~3.8x vs sorted() / list.sort() |
It does np.asarray(...) then .tolist() around a fast kernel. Those two conversions dominate the runtime (Amdahl's law), so the kernel speedup is mostly hidden — you get np.sort-class throughput, not the "by a mile" number. |
quill.sort_array(ndarray) |
ndarray in, ndarray out |
~5.3x int64, ~3.3x float64, ~4x GPU vs np.sort |
Zero conversions. It dispatches the raw buffer straight to the fastest installed backend. This is where Quill actually pulls away from numpy. |
Plain version: if your data starts and ends as a Python list, the
conversion wall caps the win — quill_sort is np.sort-fast and ~4x faster than
the built-in sorted(), which is already a great deal, but it is not 5x. If
you keep your data in numpy arrays and call sort_array, you get the full
multi-threaded / GPU speedup. Don't believe any library that claims a 5x
list-in / list-out sort — the conversion alone forbids it.
All numbers below are measured on this development box (Windows 11, 28-core CPU, RTX 4060 Ti, numpy 2.4.6) with a freshly-shuffled array on every timed run, so they reflect real cold-sort throughput, not warm/already-sorted artifacts. Your mileage will vary with CPU, GPU, array size, and dtype.
Installation
pip install quill-sort # core — ZERO required deps, always correct
pip install quill-sort[fast] # + numpy + psutil (the baseline fast path)
pip install quill-sort[polars] # + polars (no-compile multi-threaded sort)
pip install quill-sort[gpu] # + cupy (NVIDIA GPU sort)
pip install quill-sort[all] # numpy + pandas + psutil + polars
The compiled Rust backend (the ~5x path) ships as a pre-built binary wheel
for common platforms — pip install quill-sort just gets it, no Rust toolchain
needed. If no wheel matches your platform, pip falls back to the pure-Python
sdist and Quill transparently uses the next-fastest backend (GPU → polars →
numpy → np.sort). The install never fails for lack of a compiler.
How it picks (the dispatch chain)
sort_array() profiles the array and walks a priority chain, taking the first
backend that is installed and supports the array. Every step has a measured
crossover (min_n) below which it isn't worth it, and any backend error
falls straight back to np.sort — correctness is never at risk.
sort_array(ndarray)
│
▼
eligible? (1-D, dtype kind i/u/f, itemsize ≤ 8, value-only, C-contiguous)
│ no ──────────────────────────────────────────────► np.sort
│ yes
▼
dense bounded int64/uint64? ──► counting sort (np.bincount) ~1.7-2.8x, single-thread
│ no
▼
┌──────────────────────────────────────────────────────────┐
│ try, in priority order, the first available + supporting │
│ rust_voracious (compiled radix, n ≥ 1M) │
│ cupy_gpu (GPU, n ≥ 2M, fits in free VRAM) │
│ polars (no-compile MT sort, n ≥ 200k) │
│ numpy_parallel (int-only partition sort, n ≥ 5M) │
└──────────────────────────────────────────────────────────┘
│ any error / nothing eligible
▼
np.sort ← the floor; always correct, never slower than numpy
NaN is stripped before any backend runs and re-appended at the end (numpy
convention), so a backend that can't order NaN never sees one. Descending is a
post-sort reverse. quill.available_backends() shows what your machine will
actually use, in this order.
Backends
| Backend | How to get it | Measured (int64, vs np.sort) |
Notes |
|---|---|---|---|
rust_voracious |
pip install quill-sort (bundled wheel) |
~5.3x (float64 ~3.3x) | Compiled PyO3 + voracious_radix_sort, multi-threaded radix. Fastest measured path; leads the chain. |
cupy_gpu |
pip install quill-sort[gpu] |
~4x (float64 ~2.5x) | GPU radix via CuPy. Counts the host→device→host round trip and still wins for large arrays that fit in VRAM. |
polars |
pip install quill-sort[polars] |
~2.3x (float64 ~1.7x) | Delegates to polars' multi-threaded Rust sort. No compiler needed — the "fast for everyone" fallback. |
numpy_parallel |
pip install quill-sort[fast] |
~1.1x (int only) | Thread-parallel np.partition sample-sort. Small but reliable int win; deliberately uses few workers (the gain saturates at memory bandwidth). |
| counting sort | pip install quill-sort[fast] |
~1.7-2.8x | np.bincount for dense bounded int64/uint64. O(n+k), single-threaded, beats every comparison backend in its niche. |
np.sort |
always (with numpy) | 1.0x (the floor) | The never-lose baseline. Reached on any error or ineligible dtype. |
Without numpy at all, Quill still sorts correctly via the stdlib Timsort — it is simply slower. Correctness has zero required dependencies.
The array API — sort_array()
import numpy as np
import quill
a = np.random.randint(0, 2**40, 20_000_000)
s = quill.sort_array(a) # sorted COPY (a untouched), fastest backend
quill.sort_array(a, inplace=True) # sort a IN PLACE, returns a
d = quill.sort_array(a, descending=True) # reverse order
quill.available_backends()
# → ['rust_voracious', 'cupy_gpu', 'polars', 'numpy_parallel'] (priority order)
sort_array matches np.sort exactly — including negatives, mixed int/float
promotion, and NaN-to-end — and is never slower than np.sort on any input.
Top-k — quill_topk()
When you only need the k smallest (or largest), don't sort the whole array.
quill_topk uses numpy's argpartition (introselect, O(n)) instead of a full
O(n log n) sort:
quill.quill_topk(scores, 10) # 10 smallest, sorted ascending
quill.quill_topk(scores, 10, largest=True) # 10 largest, sorted descending
quill.quill_topk(rows, 5, key=lambda r: r.size)
Measured ~8x faster than sorted(data, reverse=True)[:10] for k=10 over 10M
elements on this machine.
analyze() — inspect the profile
quill.analyze([3, 1, 4, 1, 5, 9])
# {'n': 6, 'dtype': 'int_pos', 'presorted': False, 'dense': True, ...}
The list API — quill_sort() / quill_sorted()
A drop-in replacement for sorted() / list.sort() that routes numeric lists
through the fast numeric kernel and everything else through Timsort. It returns
a list, so it pays the conversion tax described above — expect np.sort-class
speed and ~3-4x over the built-ins on numeric data, not the array-API number.
quill.quill_sort([3, 1, 4, 1, 5, 9]) # in-place, returns the list
quill.quill_sort(data, key=lambda x: x['score']) # objects via key
quill.quill_sort(data, reverse=True) # descending
quill.quill_sort(data, inplace=False) # return a new list, leave input alone
quill.quill_sort(data, parallel=True) # force multi-core
quill.quill_sort(data, stable=False) # unstable, max speed on numeric
result = quill.quill_sorted(iterable) # non-mutating, mirrors sorted()
quill.quill_sort(
data, # list, generator, range, ndarray, Series, DataFrame
key=None, # sort key function (like sorted())
reverse=False, # descending order
inplace=True, # mutate in place (False → return a new list)
parallel=False, # use multiple cores (auto on big numeric data)
high_performance_mode=False, # skip the interactive prompt on the external-sort path
silent=False, # suppress status output
stable=True, # True = matches sorted() exactly; False = max speed
stats=False, # return (sorted_list, stats_dict)
)
stable parameter
stable=True (default) guarantees equal elements keep their original relative
order — identical to Python's sorted(). stable=False allows the
benchmark-optimal unstable kernels on numeric data for extra speed when order
among equal elements doesn't matter.
quill.quill_sort(data) # stable, safe for all use cases
quill.quill_sort(data, stable=False) # unstable, faster for large int32/int64
stats — timing
result, stats = quill.quill_sort(data, stats=True)
# stats = {'time_ms': 12.3, 'n': 1000000}
The never-lose guarantee
Every fast path in Quill is wrapped so that any failure — a missing backend,
a GPU OOM, a compiled-extension panic, an unsupported dtype — falls back to
np.sort (or stdlib Timsort if numpy is absent). The Rust extension is built
with panic = "unwind", so even a native panic becomes a catchable Python
exception and the process never aborts. You can never get a result that is wrong
or a sort that is slower than the numpy baseline.
# To debug a backend instead of silently falling back, surface the error:
QUILL_BACKEND_DEBUG=1 python your_script.py
Correctness contract
Quill's output matches sorted() / np.sort on every supported input:
None values sort to the end (start with reverse=True):
quill.quill_sort([3, None, 1, None, 2])
# → [1, 2, 3, None, None]
NaN values in float data go to the end (start with reverse=True),
matching numpy's convention:
quill.quill_sort([3.0, float('nan'), 1.0, 2.0])
# → [1.0, 2.0, 3.0, nan]
Mixed int/float lists are promoted to float64 and take the fast float path (not the slow object fallback):
quill.quill_sort([1, 2.5, 3, 4.0])
# → [1, 2.5, 3, 4.0]
Negative integers are handled natively (two's-complement radix; no object fallback), and keyed sorts are stable by default.
Supported types
int,float,str,bytes— native fast paths- Mixed
int+float— promoted to float64 - Negative integers — handled natively
Nonevalues — sorted to the end (start withreverse=True)float('nan')— sorted to the end (start withreverse=True)numpy.ndarray— sorted via the backend chain (usesort_arrayfor the full win)pandas.Series— sorted, returned as a new Seriespandas.DataFrame— sorted by column(s) viakey='column_name'- Any generator / iterator — materialised to a list, sorted, returned
Plugin system
Teach Quill to sort your own types. A plugin declares which types it handles
and a prepare() that converts them to something Quill sorts natively, plus an
optional postprocess to rebuild your objects.
from quill import register_plugin, QuillPlugin
class MyPlugin(QuillPlugin):
handles = (MyCustomClass,)
name = "my_custom_class"
@staticmethod
def prepare(data, key, reverse):
items = [x.value for x in data]
postprocess = lambda sorted_vals: [MyCustomClass(v) for v in sorted_vals]
return items, key, postprocess # (list_to_sort, key_fn, postprocess_fn)
register_plugin(MyPlugin)
quill.quill_sort(list_of_my_objects)
Built-in plugins cover numpy.ndarray, pandas.Series, pandas.DataFrame,
range, and any generator/iterator. You can also drop in a custom backend
(not just a type plugin) with quill._backends.register_backend(...) if you
have a faster kernel for some dtype.
CLI / demo / setup
python -m quill # animated benchmark demo across data types
quill # same, if installed via pip
quill setup # calibration wizard: measures the parallel/GPU crossovers
# on YOUR box and persists them to ~/.quill/config.json
Quill also ships a small visualizer/wizard and add-on helpers layered on the
core engine (the calibration wizard above, profile inspection via analyze(),
and the quill_topk add-on). They are optional conveniences — the core sort API
needs none of them.
How it works (the list path, in detail)
For a list, Quill profiles the data at intake and routes to the best strategy:
| Data shape | Strategy | Complexity |
|---|---|---|
| Dense bounded integer range | np.bincount counting sort |
O(n + k) |
| int64 / float64 (value-only) | fastest backend via the array chain | ~O(n) radix / O(n log n) |
| Mixed int/float | promote to float64, then numeric path | O(n log n) |
| Strings / bytes | Python Timsort | O(n log n) |
Objects with a key |
extract keys → numpy argsort (stable) | O(n log n) |
| Nearly sorted (asc_ratio > 0.90) | Timsort bypass | ~O(n) |
| Larger than RAM | external radix-partition + parallel bucket sort | disk-bound |
The conversion to/from numpy is the fixed cost that caps the list-path speedup —
see the very first table. For maximum throughput, keep your data in arrays and
use sort_array.
Performance tuning
- Install the right extra.
[fast](numpy + psutil) is the baseline; add[polars]for a no-compile multi-threaded sort, or[gpu]if you have an NVIDIA card. The bundled Rust wheel already gives you the top path. - Use
sort_arrayon arrays. It skips the conversion wall entirely. stable=Falseon the list path for extra speed on numeric data when you don't need stable ordering.quill setupmeasures your machine's crossovers so auto-parallel and GPU engage exactly where they win.- Install psutil for accurate RAM sensing (avoids spurious external-sort triggers); without it, Quill assumes 2 GB available.
Troubleshooting
- "
quill_sort(list)isn't 5x faster." It can't be — theasarray/tolistround trip dominates (Amdahl). It is np.sort-fast and ~3-4x oversorted(). For the big win, keep arrays and callsort_array. - "
sort_arrayisn't using the GPU / Rust path." Checkquill.available_backends(). The backend may not be installed, the array may be below its crossover (min_n), or the dtype/itemsize may be ineligible. SetQUILL_BACKEND_DEBUG=1to surface the real reason instead of falling back silently. - "GPU path errors on a huge array." It only engages when the array fits in free VRAM with headroom; otherwise it falls back to the CPU radix or np.sort.
- "Equal elements reordered." Use
stable=True(the default).stable=Falseis explicitly allowed to reorder equal elements. - "External sort triggered unexpectedly." Install psutil for accurate RAM
sensing; pass
high_performance_mode=Trueto skip the interactive prompt.
Development
git clone https://github.com/invariant-games/quill.git
cd quill
pip install -e ".[all,dev]"
pytest # full suite
pytest -m slow # parallel / large-data tests
python -m quill # animated benchmark demo
# Build the compiled Rust backend locally (optional; CI ships wheels):
cd rustext && maturin develop --release
The CI workflow .github/workflows/wheels.yml builds the rustext crate into
abi3 binary wheels (one wheel per platform covers Python ≥ 3.8) for Windows,
manylinux x86_64/aarch64, and macOS x86_64/arm64, plus a pure-Python sdist as
the never-fail-install fallback.
Requirements
- Python 3.8+
numpy— optional but strongly recommended (pip install quill-sort[fast])psutil— optional, for accurate RAM sensingpolars— optional backend (pip install quill-sort[polars])cupy— optional GPU backend (pip install quill-sort[gpu], NVIDIA only)pandas— optional (Series/DataFrame support)
A C/Rust toolchain is not required to install — the compiled backend is shipped as a pre-built wheel, with a pure-Python sdist fallback.
License
MIT — by Isaiah Tucker
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quill_sort-6.0.4.tar.gz.
File metadata
- Download URL: quill_sort-6.0.4.tar.gz
- Upload date:
- Size: 89.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8136591d47a4ec6bde955d4f60abd19be27b4ee2b7fe176b7ee0c4abd04da497
|
|
| MD5 |
8e9090dcb564bfa116ea8c2c3d9108b1
|
|
| BLAKE2b-256 |
c5effad4ad3f160de042b9aeaff4acb0927ac285f8d32e0918f0cfc187aef73c
|
File details
Details for the file quill_sort-6.0.4-py3-none-any.whl.
File metadata
- Download URL: quill_sort-6.0.4-py3-none-any.whl
- Upload date:
- Size: 88.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8b7fb63b6e71e1390354e8bdc51542ed5397e8b70814c915b7be8eced0dd132
|
|
| MD5 |
67a574369575cac6eccc0156187e44ec
|
|
| BLAKE2b-256 |
199c352ed1edce83b7859119a16d9c175108ad7ca58a45fcf5e370c2d0c16941
|