Skip to main content

OptiMask: extracting the largest (non-contiguous) submatrix without NaN

Project description

Logo OptiMask OptiMask: Efficient NaN Data Removal in Python

PyPI Version Conda Version Conda Downloads Documentation Status Unit tests Codacy Badge

Introduction

OptiMask is a Python package designed to facilitate the process of removing NaN (Not-a-Number) data from matrices while efficiently computing the largest (and not necessarily contiguous) submatrix without NaN values. This tool prioritizes practicality and compatibility with Numpy arrays and Pandas DataFrames.

Key Features

  • Largest Submatrix without NaN: OptiMask calculates the largest submatrix without NaN, enhancing data analysis accuracy.
  • Efficient Computation: With optimized computation, OptiMask provides rapid results without undue delays.
  • Numpy and Pandas Compatibility: OptiMask seamlessly adapts to both Numpy and Pandas data structures.

Utilization

To employ OptiMask, install the optimask package via pip:

pip install optimask

OptiMask is also available on the conda-forge channel:

conda install -c conda-forge optimask
mamba install optimask

Usage Example

Import the OptiMask class from the optimask package and utilize its methods for efficient data masking:

from optimask import OptiMask
import numpy as np

# Create a matrix with NaN values
m = 120
n = 7
data = np.zeros(shape=(m, n))
data[24:72, 3] = np.nan
data[95, :5] = np.nan

# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)

# Calculate the ratio of non-NaN values in the result
coverage_ratio = len(rows) * len(cols) / data.size

# Check if there are any NaN values in the selected submatrix
has_nan_values = np.isnan(data[rows][:, cols]).any()

# Print or display the results
print(f"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}")
# Output: Coverage Ratio: 0.85, Has NaN Values: False

The grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:

OptiMask’s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:

Performances

OptiMask efficiently handles large matrices, delivering results within reasonable computation times:

from optimask import OptiMask
import numpy as np

def generate_random(m, n, ratio):
    """Missing at random arrays"""
    arr = np.zeros((m, n))
    nan_count = int(ratio * m * n)
    indices = np.random.choice(m * n, nan_count, replace=False)
    arr.flat[indices] = np.nan
    return arr

x = generate_random(m=100_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask(verbose=True).solve(x)
>>> 	Trial 1 : submatrix of size 37094x49 (1817606 elements) found.
>>> 	Trial 2 : submatrix of size 35667x51 (1819017 elements) found.
>>> 	Trial 3 : submatrix of size 37908x48 (1819584 elements) found.
>>> 	Trial 4 : submatrix of size 37047x49 (1815303 elements) found.
>>> 	Trial 5 : submatrix of size 37895x48 (1818960 elements) found.
>>> Result: the largest submatrix found is of size 37908x48 (1819584 elements) found.
>>> CPU times: total: 172 ms
>>> Wall time: 435 ms

Documentation

For detailed documentation, including installation instructions, API usage, and examples, visit OptiMask Documentation.

Repository Link

Find more about OptiMask on GitHub.

Citation

If you use OptiMask in your research or work, please cite it:

@software{optimask2024,
  author = {Cyril Joly},
  title = {OptiMask: NaN Removal and Largest Submatrix Computation},
  year = {2024},
  url = {https://github.com/CyrilJl/OptiMask},
}

Or:

OptiMask (2024). NaN Removal and Largest Submatrix Computation. Developed by Cyril Joly: https://github.com/CyrilJl/OptiMask

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimask-1.3.tar.gz (8.0 kB view details)

Uploaded Source

File details

Details for the file optimask-1.3.tar.gz.

File metadata

  • Download URL: optimask-1.3.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for optimask-1.3.tar.gz
Algorithm Hash digest
SHA256 5595bf9df868089cd54b85b1869e3fc54b222d74d1d1afd73958b86e6b36f6e4
MD5 99ce641710ade1576e3b9444a1c684c4
BLAKE2b-256 ca82cb58d46cfd1bf318a68c164687dab648b7cc0d1dbcefb46c0f2e6d1bc9a2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page