This package boosts a group-wise nlargest sort
Project description
topn
Utility function for string_grouper
to use instead of pandas' SeriesGroupBy
nlargest()
function (since pandas does it so slowly).
import pandas as pd
import numpy as np
r = np.array([0, 1, 2, 1, 2, 3, 2])
c = np.array([1, 1, 0, 3, 1, 2, 3])
d = np.array([0.0, 0.2, 0.1, 1.0, 0.9, 0.4, 0.6])
rcd = pd.DataFrame({'r': r, 'c': c, 'd': d})
rcd
r | c | d | |
---|---|---|---|
0 | 0 | 1 | 0.0 |
1 | 1 | 1 | 0.2 |
2 | 2 | 0 | 0.1 |
3 | 1 | 3 | 1.0 |
4 | 2 | 1 | 0.9 |
5 | 3 | 2 | 0.4 |
6 | 2 | 3 | 0.6 |
ntop = 2
rcd.set_index('c').groupby('r')['d'].nlargest(ntop).reset_index().sort_values(['r', 'd'], ascending = [True, False])
r | c | d | |
---|---|---|---|
0 | 0 | 1 | 0.0 |
1 | 1 | 3 | 1.0 |
2 | 1 | 1 | 0.2 |
3 | 2 | 1 | 0.9 |
4 | 2 | 3 | 0.6 |
5 | 3 | 2 | 0.4 |
Usage
from topn import awesome_topn
r, c, d = awesome_topn(r, c, d, ntop, n_jobs=7)
pd.DataFrame({'r': r, 'c': c, 'd': d})
r | c | d | |
---|---|---|---|
0 | 0 | 1 | 0.0 |
1 | 1 | 3 | 1.0 |
2 | 1 | 1 | 0.2 |
3 | 2 | 1 | 0.9 |
4 | 2 | 3 | 0.6 |
5 | 3 | 2 | 0.4 |
Short Description
def awesome_topn(r, c, d, ntop, n_rows=-1, n_jobs=1):
"""
r, c, and d are 1D numpy arrays all of the same length N.
This function will return arrays rn, cn, and dn of length n <= N such
that the set of triples {(rn[i], cn[i], dn[i]) : 0 < i < n} is a subset of
{(r[j], c[j], d[j]) : 0 < j < N} and that for every distinct value
x = rn[i], dn[i] is among the first ntop existing largest d[j]'s whose
r[j] = x.
Input:
r and c: two 1D integer arrays of the same length
d: 1D array of single or double precision floating point type of the
same length as r or c
ntop maximum number of maximum d's returned
n_rows: an int. If > -1 it will replace output rn with Rn the
index pointer array for the compressed sparse row (CSR) matrix
whose elements are {C[rn[i], cn[i]] = dn: 0 < i < n}. This matrix
will have its number of rows = n_rows. Thus the length of Rn is
n_rows + 1
n_jobs: number of threads, must be >= 1
Output:
(rn, cn, dn) where rn, cn, dn are all arrays as described above, or
(Rn, cn, dn) when n_rows > -1, where Rn is described above
"""
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
topn-0.0.5.post2.tar.gz
(11.5 kB
view hashes)