Utility functions for scalable handling of CSR matrices
Project description
csr_utils
Scalable Operations for CSR matrices.
Installation
For general users and if using conda etc.:
pip install csr_utils
Without root access:
pip install --user csr_utils
Usage
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> xcsr = csr_matrix(np.array([[1, 0], [3, 4], [2, 2]], dtype=float))
>>> import csr_utils
>>> xnorm, xmean, xstd, xixnormed = normalize_csr_matrix(xcsr)
Overview
This package currently only has a fast and memory efficient implementation for normalizing nonzero values of a CSR array without un-sparsifying the function. This is useful step for machine learning on large matrices. Most algorithms work better with normalized input, in particular the commonly used linear classification models in sklearn.
There will be more functions added as the need arises, including turning a csr array into cuda sparse array directly. Stay tuned.
normalize_csr_matrix:
-
Normalizes a CSR matrix only based on non-zero values, without turning it into dense array.
- In the CSR matrix, rows correspond to samples, and columns correspond to features.
- Normalization will be such that each column (feature)'s non-zero values will have mean of 0.0 and standard deviation of 1.0.
-
Will return the scalable equivalent of x = x.toarray(); x[(x==0)] = np.nan; (x - np.nanmean(x, axis=0)) / np.nanstd(x, axis=0)
-
We compute a faster and equivalent definition of standard deviation:
sigma = SquareRoot(ExpectedValue(|X - mean|^2)) # slow
sigma = SquareRoot(ExpectedValue(X^2) - ExpectedValue(X)^2) # fast
- For more info see the math
-
This function makes the following assumptions:
- If we don't have any observations in a column i, mean_array[i] be set to 0.0, and std_array[i] will be set to 1.0.
- If we have a single observation, or if standard deviation is 0.0 for a column, we only subtract the mean for that column.
-
(Useful for normalizing test sets:) The function allows the normalization to be based on pre-specified mean and standard deviation arrays.
-
The function also allows only a given subset of features to be normalized.
Example
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> x = csr_matrix(np.array([[1, 0, 0], [3, 0, 4], [2, 5, 2]], dtype=float))
>>> print(x.toarray())
[[ 1. 0. 0.]
[ 3. 0. 4.]
[ 2. 5. 2.]]
>>> xnorm, xmean, xstd, xixnormed = csr_utils.normalize_csr_matrix(x)
>>> print(xnorm.todense())
[[-1.22474487 0. 0. ]
[ 1.22474487 0. 1. ]
[ 0. 0. -1. ]]
>>> xmean
array([2., 5., 3.])
>>> xstd
array([0.81649658, 1. , 1. ])
>>> xixnormed
array([0, 2])
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for csr_utils-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb592b12ae19d9c827fb949abe7fe31cfc53a2299231947c943662a41c0c708b |
|
MD5 | 616840ad5a96d9098dc838fd2ebfad1d |
|
BLAKE2b-256 | b1f460975f0eb14292f357ffb37d0c8fb49a8e85e7c22864aadc5b7c9c3ffdf4 |