Skip to main content

This package boosts a group-wise nlargest sort

Project description

topn

Utility function for string_grouper to use instead of pandas' SeriesGroupBy nlargest() function (since pandas does it so slowly).

import pandas as pd
import numpy as np

r = np.array([0, 1, 2, 1, 2, 3, 2]) 
c = np.array([1, 1, 0, 3, 1, 2, 3]) 
d = np.array([0.0, 0.2, 0.1, 1.0, 0.9, 0.4, 0.6]) 
rcd = pd.DataFrame({'r': r, 'c': c, 'd': d})
rcd
r c d
0 0 1 0.0
1 1 1 0.2
2 2 0 0.1
3 1 3 1.0
4 2 1 0.9
5 3 2 0.4
6 2 3 0.6
ntop = 2
rcd.set_index('c').groupby('r')['d'].nlargest(ntop).reset_index().sort_values(['r', 'd'], ascending = [True, False])
r c d
0 0 1 0.0
1 1 3 1.0
2 1 1 0.2
3 2 1 0.9
4 2 3 0.6
5 3 2 0.4

Usage

from topn import awesome_topn

r, c, d = awesome_topn(r, c, d, ntop, n_jobs=7)
pd.DataFrame({'r': r, 'c': c, 'd': d})
r c d
0 0 1 0.0
1 1 3 1.0
2 1 1 0.2
3 2 1 0.9
4 2 3 0.6
5 3 2 0.4

Short Description

def awesome_topn(r, c, d, ntop, n_rows=-1, n_jobs=1):
    """
    r, c, and d are 1D numpy arrays all of the same length N. 
    This function will return arrays rn, cn, and dn of length n <= N such
    that the set of triples {(rn[i], cn[i], dn[i]) : 0 < i < n} is a subset of 
    {(r[j], c[j], d[j]) : 0 < j < N} and that for every distinct value 
    x = rn[i], dn[i] is among the first ntop existing largest d[j]'s whose 
    r[j] = x.

    Input:
        r and c: two 1D integer arrays of the same length
        d: 1D array of single or double precision floating point type of the
        same length as r or c
        ntop maximum number of maximum d's returned
        n_rows: an int. If > -1 it will replace output rn with Rn the
            index pointer array for the compressed sparse row (CSR) matrix
            whose elements are {C[rn[i], cn[i]] = dn: 0 < i < n}.  This matrix
            will have its number of rows = n_rows.  Thus the length of Rn is
            n_rows + 1
        n_jobs: number of threads, must be >= 1

    Output:
        (rn, cn, dn) where rn, cn, dn are all arrays as described above, or
        (Rn, cn, dn) when n_rows > -1, where Rn is described above 
        
    """

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topn-0.0.5.tar.gz (11.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page