Skip to main content

This package boosts a group-wise nlargest sort

Project description

topn

Utility function for string_grouper to use instead of pandas' nlargest() function (since pandas does it so slowly).

import pandas as pd
import numpy as np

r = np.array([0, 1, 2, 1, 2, 3, 2]) 
c = np.array([1, 1, 0, 3, 1, 2, 3]) 
d = np.array([0.0, 0.2, 0.1, 1.0, 0.9, 0.4, 0.6]) 
rcd = pd.DataFrame({'r': r, 'c': c, 'd': d})
rcd
r c d
0 0 1 0.0
1 1 1 0.2
2 2 0 0.1
3 1 3 1.0
4 2 1 0.9
5 3 2 0.4
6 2 3 0.6
ntop = 2
rcd.set_index('c').groupby('r')['d'].nlargest(ntop).reset_index().sort_values(['r', 'd'], ascending = [True, False])
r c d
0 0 1 0.0
1 1 3 1.0
2 1 1 0.2
3 2 1 0.9
4 2 3 0.6
5 3 2 0.4

Usage

from topn import awesome_topn

r, c, d = awesome_topn(r, c, d, ntop, n_jobs=7)
pd.DataFrame({'r': r, 'c': c, 'd': d})
r c d
0 0 1 0.0
1 1 3 1.0
2 1 1 0.2
3 2 1 0.9
4 2 3 0.6
5 3 2 0.4

Short Description

def awesome_topn(r, c, d, ntop, use_threads=False, n_jobs=1):
    """
    r, c, and d are 1D numpy arrays all of the same length N. 
    This function will return arrays rn, cn, and dn of length n <= N such
    that the set of triples {(rn[i], cn[i], dn[i]) : 0 < i < n} is a subset of 
    {(r[j], c[j], d[j]) : 0 < j < N} and that for every distinct value 
    x = rn[i], dn[i] is among the first ntop existing largest d[j]'s whose 
    r[j] = x.

    Input:
        r and c: two 1D integer arrays of the same length
        d: 1D array of single or double precision floating point type of the
        same length as r or c
        ntop maximum number of maximum d's returned
        use_threads: use multi-thread or not
        n_jobs: number of threads, must be >= 1

    Output:
        (rn, cn, dn) where rn, cn, dn are all arrays as described above.
    """

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topn-0.0.2.tar.gz (7.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page