Skip to main content

OptiMask: extracting the largest (non-contiguous) submatrix without NaN

Project description

PyPI Version Conda Version Conda Downloads Documentation Status Unit tests Codacy Badge

Logo OptiMask

OptiMask: Efficient NaN Data Removal in Python

OptiMask is a Python package designed for efficiently handling NaN values in matrices, specifically focusing on computing the largest non-contiguous submatrix without NaN. OptiMask employs a heuristic method, relying on numpy and numba for speed and efficiency. In machine learning applications, OptiMask surpasses traditional methods like pandas dropna by maximizing the amount of valid data available for model fitting. It strategically identifies the optimal set of columns (features) and rows (samples) to retain or remove, ensuring that the largest (non-contiguous) submatrix without NaN is utilized for training models.

The problem differs from the computation of the largest rectangles of 1s in a binary matrix (which can be tackled with dynamic programming) and requires a novel approach. The problem also differs from this algorithmic challenge in that it requires rearranging both columns and rows, rather than just columns.

Key Features

  • Largest Submatrix without NaN: OptiMask calculates the largest submatrix without NaN, enhancing data analysis accuracy.
  • Efficient Computation: With optimized computation, OptiMask provides rapid results without undue delays.
  • Numpy, Pandas and Polars Compatibility: OptiMask adapts to numpy, pandas and polars data structures.

Utilization

To employ OptiMask, install the optimask package via pip:

pip install optimask

OptiMask is also available on the conda-forge channel:

conda install -c conda-forge optimask
mamba install optimask

Usage Example

Import the OptiMask class from the optimask package and utilize its methods for efficient data masking:

from optimask import OptiMask
import numpy as np

# Create a matrix with NaN values
m = 120
n = 7
data = np.zeros(shape=(m, n))
data[24:72, 3] = np.nan
data[95, :5] = np.nan

# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)

# Calculate the ratio of non-NaN values in the result
coverage_ratio = len(rows) * len(cols) / data.size

# Check if there are any NaN values in the selected submatrix
has_nan_values = np.isnan(data[rows][:, cols]).any()

# Print or display the results
print(f"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}")
# Output: Coverage Ratio: 0.85, Has NaN Values: False

The grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:

Strutured NaN

OptiMask’s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:

Unstructured NaN

Performances

OptiMask efficiently handles large matrices, delivering results within reasonable computation times:

from optimask import OptiMask
import numpy as np

def generate_random(m, n, ratio):
    """Missing at random arrays"""
    return np.random.choice(a=[0, np.nan], size=(m, n), p=[1-ratio, ratio])

x = generate_random(m=100_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask(verbose=True).solve(x)
# CPU times: total: 609 ms
# Wall time: 191 ms
# 	Trial 1 : submatrix of size 35008x52 (1820416 elements) found.
# 	Trial 2 : submatrix of size 35579x51 (1814529 elements) found.
# 	Trial 3 : submatrix of size 37900x48 (1819200 elements) found.
# 	Trial 4 : submatrix of size 38040x48 (1825920 elements) found.
# 	Trial 5 : submatrix of size 37753x48 (1812144 elements) found.
# Result: the largest submatrix found is of size 38040x48 (1825920 elements) found.

Documentation

For detailed documentation,API usage, examples and insights on the algorithm, visit OptiMask Documentation.

Related Project: timefiller

If you're working with time series data, check out timefiller, another Python package I developed for time series imputation. timefiller is designed to efficiently handle missing data in time series and relies heavily on optimask.

Citation

If you use OptiMask in your research or work, please cite it:

@software{optimask2024,
  author = {Cyril Joly},
  title = {OptiMask: NaN Removal and Largest Submatrix Computation},
  year = {2024},
  url = {https://github.com/CyrilJl/OptiMask},
}

Or:

OptiMask (2024). NaN Removal and Largest Submatrix Computation. Developed by Cyril Joly: https://github.com/CyrilJl/OptiMask

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimask-1.3.11.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optimask-1.3.11-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file optimask-1.3.11.tar.gz.

File metadata

  • Download URL: optimask-1.3.11.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for optimask-1.3.11.tar.gz
Algorithm Hash digest
SHA256 9c60c30bd88fe5e7e211b6a945c92937e61d2010ed7a425312e0d8a83a7f1d07
MD5 c644465fb96ecff24b6ec269d259001d
BLAKE2b-256 3c67552d32022a95b452921d050b1c9a5da1c22552adc1ababe384399b618c2d

See more details on using hashes here.

File details

Details for the file optimask-1.3.11-py3-none-any.whl.

File metadata

  • Download URL: optimask-1.3.11-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for optimask-1.3.11-py3-none-any.whl
Algorithm Hash digest
SHA256 2b79334b803746e3ea5e6c3a9c4a5339a974f84f1519af2f3e7c1839fbf81b8a
MD5 48f66adff55049af3b8951d4a5bef75e
BLAKE2b-256 51183045de7526f5d8805e053113ff8f94ee439da505b500ce6391b2f8367809

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page