Skip to main content

Utility functions for scalable handling of CSR matrices

Project description

csr_utils

Scalable Operations for CSR matrices.

Build Status

Installation

For general users and if using conda etc.:

pip install csr_utils

Without root access:

pip install --user csr_utils

Usage

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> xcsr = csr_matrix(np.array([[1, 0], [3, 4], [2, 2]], dtype=float))

>>> import csr_utils
>>> xnorm, xmean, xstd, xixnormed = normalize_csr_matrix(xcsr)

Overview

This package currently only has a fast and memory efficient implementation for normalizing nonzero values of a CSR array without un-sparsifying the function. This is useful step for machine learning on large matrices. Most algorithms work better with normalized input, in particular the commonly used linear classification models in sklearn.

There will be more functions added as the need arises, including turning a csr array into cuda sparse array directly. Stay tuned.

normalize_csr_matrix:

  • Normalizes a CSR matrix only based on non-zero values, without turning it into dense array.

    • In the CSR matrix, rows correspond to samples, and columns correspond to features.
    • Normalization will be such that each column (feature)'s non-zero values will have mean of 0.0 and standard deviation of 1.0.
  • Will return the scalable equivalent of x = x.toarray(); x[(x==0)] = np.nan; (x - np.nanmean(x, axis=0)) / np.nanstd(x, axis=0)

  • We compute a faster and equivalent definition of standard deviation:

    • sigma = SquareRoot(ExpectedValue(|X - mean|^2)) # slow
    • sigma = SquareRoot(ExpectedValue(X^2) - ExpectedValue(X)^2) # fast
    • For more info see the math
  • This function makes the following assumptions:

    • If we don't have any observations in a column i, mean_array[i] be set to 0.0, and std_array[i] will be set to 1.0.
    • If we have a single observation, or if standard deviation is 0.0 for a column, we only subtract the mean for that column.
  • (Useful for normalizing test sets:) The function allows the normalization to be based on pre-specified mean and standard deviation arrays.

  • The function also allows only a given subset of features to be normalized.

Example

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> x = csr_matrix(np.array([[1, 0], [3, 4], [2, 2]], dtype=float))
>>> x
<3x2 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
>>> print(x.toarray())
[[ 1.  0.]
 [ 3.  4.]
 [ 2.  2.]]
>>> xnorm, xmean, xstd, xixnormed = normalize_csr_matrix(x)
>>> a, amean, astd, aixnormed = csr_utils.normalize_csr_matrix(a)
print(xnorm.todense())
[[-1.22474487  0.        ]
 [ 1.22474487  1.        ]
 [ 0.         -1.        ]]
>>> xmean
array([ 2.,  3.])
>>> xstd
array([ 0.81649658,  1.        ])
>>> xixnormed
array([0, 1])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csr_utils-0.1.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

csr_utils-0.1.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file csr_utils-0.1.1.tar.gz.

File metadata

  • Download URL: csr_utils-0.1.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for csr_utils-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a3e3a1a5eadae4e47236a9bd377c219ad8f578d4a95bab86206f05c72123c6c7
MD5 dd603111b41eb5ec9d550582ec0a7e79
BLAKE2b-256 d90e065ee770524bdb78e9e1824361fed7a5f3e1a1644a55e7c855c7d65d6c26

See more details on using hashes here.

File details

Details for the file csr_utils-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for csr_utils-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2a0c738498d7185bf1d3ba72893cb202b0081216e1c91647c6e5902c02e528ff
MD5 fac63af3fcc0e14ec54e6207b9346ff5
BLAKE2b-256 4df151a09455cd3faffdea26e3d2d80c7f634396f54885d1ed6d95158d5701b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page