A partially and lazily sorted list data structure

## Project description

lazysorted is a Python extension module for sorting sequences lazily. It presents the programmer with the abstraction that they are actually working with a sorted list, when in fact the list is only physically sorted when the programmer requests elements from it, and even then it is only sorted partially, just enough to return whatever was requested.

The LazySorted object has a constructor that implements the same
interface as the builtin `sorted(...)` function, and it supports most
of the non-mutating methods of a python list.

Since the LazySorted object only sorts as much as necessary, it can be
faster than using the builtin `sorted(...)` for tasks that do not
require the entire data to be sorted, like:

- Computing medians
- Computing truncated means
- Quickly iterating through the first few sorted elements of a list
- Computing the deciles or quartiles of some data

## How to use it

You can use LazySorted in much the same way you use the `sorted(...)`
function and the python lists it produces:

from lazysorted import LazySorted from math import floor, ceil def median(xs): """An expected linear time median function""" ls = LazySorted(xs) n = len(ls) if n == 0: raise ValueError("Need a non-empty iterable") elif n % 2 == 1: return ls[n//2] else: return sum(ls[(n/2-1):(n/2+1)]) / 2.0 def top_k(xs, k, key=None, reverse=False): """Efficiently computes the top k elements of xs using the given key, or the bottom k if reverse=True""" ls = LazySorted(xs, key=key, reverse=reverse) return ls[0:k] def trimmed_mean(xs, alpha=0.05): """Computes the mean of the xs from the alpha to (1-alpha) quantiles in expected linear time. More robust than the ordinary sample mean.""" if not 0 <= alpha < 0.5: raise ValueError("alpha must be in [0, 0.5)") ls = LazySorted(xs) n = len(ls) if n == 0: raise ValueError("Need a non-empty iterable") lower = int(floor(n * alpha)) upper = int(ceil(n * (1 - alpha))) return sum(ls.between(lower, upper)) / (upper - lower)

In addition to the `__len__` and `__getitem__` methods demostrated
above, LazySorted also supports the `__iter__`, `__contains__`,
`index`, and `count` methods, just like a regular python list:

>>> import random >>> from lazysorted import LazySorted >>> xs = list(range(1000)) + 5 * [1234] >>> random.shuffle(xs) >>> ls = LazySorted(xs) >>> for x in ls: ... print(x) ... if x >= 3: ... break 0 1 2 3 >>> 1235 in ls False >>> ls.index(821) 821 >>> ls.count(1234) 5

Although the LazySorted constructor pretends to be equivalent to the
`sorted` function, and the LazySorted object pretends to be equivalent
to a sorted python list, there are a few differences between them:

- LazySorted objects are immutable, while python lists are not.
- Sorting with the builtin
`sorted`function is guaranteed to be stable, (ie, preserve the original order of elements that compare equal), while LazySorted sorting is not stable. - The LazySorted object has a
`between(i, j)`method, which returns a list of all the items whose sorted indices are in`range(i, j)`, but not necessarily in order. This is useful, for example, for throwing away outliers when computing an alpha-trimmed mean.

When the APIs differ between python2.x and python3.x, lazysorted
implements the python3.x version. So the LazySorted constructor does not
support the `cmp` argument that was removed in python3.x, and the
LazySorted object does not support the `__getslice__` method that was
also removed in python3.x.

All of the LazySorted methods have pretty good documentation, which can
be accessed through the builtin `help(...)` function.

I’ve tested lazysorted and found it to work for CPython versions 2.5, 2.6, 2.7, and 3.1, 3.2, and 3.3. I haven’t tested 3.0.

## How it works

In short, LazySorted works by using quicksort partitions lazily and keeping track of the indices used as pivots.

**`quicksort <http://en.wikipedia.org/wiki/Quicksort>`_** sorts a list
by picking an element of the list to be the “pivot”, and then
partitioning the data into the part that’s greater than or equal to the
pivot and the part that’s less than the pivot. These two parts are then
recursively sorted with quicksort,

**`quickselect <http://en.wikipedia.org/wiki/Quickselect>`_** finds the
kth smallest element of a list by picking a pivot element and
partitioning the data, as in quicksort. Then the algorithm recurses into
the larger or smaller part of the list, depending on whether k is larger
or smaller than the index of the pivot element.

There are two key observations to make from these algorithms: First of all, if we are only interested in part of a sorted list, we only need to recurse into the part we are interested in after doing a partition. Second of all, after doing some partitions, the list is partially sorted, with the pivots all in their sorted order and the elements between two pivots guaranteed to be bigger than the pivot to their left and smaller than the pivot to their right.

So whenever some data is queried from a LazySorted object, we first look through the pivots to see which pivots delimit the data we want. Then we partition sublist(s) as necessary and recurse into the side(s) that our data is in.

There are also some implementation details that help lazysorted to run quickly: First of all, pivots elements are chosen to be the median of three randomly elements, which makes the partition likely to be more balanced and guarantees average case O(n log n) behavior.

Second of all, for sufficiently small lists, lazysorted uses insertion sort instead of quicksort, which is faster on small lists. Both of these tricks are well-known to speed up quicksort implementations.

Thirdly, since it’s important to find the pivots that bound an index quickly, lazysorted stores the pivots in a binary search tree, so that these sorts of lookups occur in O(log n) expected time. The BST lazysorted uses is a Treap, selected for its overall expected speed, especially in insertion and deletion.

lazysorted also makes a big effort to delete irrelevant pivots from the BST; for example, if there are three pivots at indices 5, 26, and 42, and both the data (between 5 and 26) and (between 26 and 42) is sorted, then we can remove the irrelevant pivot 26, and just say that the data between indices 5 and 42 is sorted.

## Installation

lazysorted requires the python headers, (Python.h). I believe they ship with OSX, but if you don’t have them they can be installed on a debian-like system with

$ sudo apt-get install python-dev

Then you can install lazysorted with

$ sudo python setup.py install

## Testing

I’ve put in a fair bit of effort to test that lazysorted actually does what it’s supposed to. You can test it yourself (after installing it) with

$ python test.py

## FAQ

**Doesn’t numpy have a median and percentile function?**

Yes, but it’s implemented by sorting the entire array and then reading off the requested values, not with quickselect or another O(n) selection algorithm. And LazySorted is empirically faster, as you can see from benchmark.py

**Isn’t python3.4 going to have a statistics module with a median
function?**

Yes, and I’m really excited about it! This is PEP450. Unfortunately, the current implementation is in pure python, and computes the median by sorting the data and picking off the middle element.

**Doesn’t the standard library have a heapq module?**

Yes, but it lacks the full generality of this module. For example, you can use it to get the k smallest elements in O(n log k) time, but not k arbitrary contiguous elements. This module represents a different paradigm: you’re allowed to program as if your list was sorted, and let the data structure deal with the details.

**How is lazysorted licensed?**

lazysorted is BSD-licensed. So you can use it pretty much however you like! See LICENSE for details.

**What should I not use lazysorted for?**

- Applications requiring a stable sort; the quicksort partitions make the order of equal elements in the sorted list undefined.
- Applications requiring guaranteed fast worst-case performance. Although it’s very unlikely, many operations in LazySorted run in worst case O(n^2) time.
- Applications requiring high security. The random number generator is insecure and seeded from system time, so an (ambitious) attacker could reverse engineer the random number generator and feed LazySorted pathological lists that make it run in O(n^2) time.
- Sorting entire lists: The builtin sorted(.) is
*very*impressively designed and implemented. It also has the advantage of running faster than O(n log n) on lists with partial structure.

**How does lazysorted work at scale?**

Unfortunately, only okay. This turns out to be primarily due to the fact that CPython deals with python objects by passing around pointers to them, causing cache misses when the list and its elements no longer fit in cache. The gory details can be found in a blog post I wrote about Memory Locality and Python Objects.

However, this effect doesn’t kick in until lists grow larger than about 100K values, and even past that lazysorted remains faster than complete sorting.

## Contact me!

If you use this software and feel so inclined, I’d greatly appreciate hearing what you are using it for! You can hit me up on Twitter @naftaliharris, or at my email address on my contact page.

## Project details

## Release history Release notifications | RSS feed

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|

Filename, size lazysorted-0.1.tar.gz (16.4 kB) | File type Source | Python version None | Upload date | Hashes View |