Skip to main content

No project description provided

Project description

Entiny

Test PyPI version

A high-performance Python package for Information-Based Optimal Subdata Selection (IBOSS) using Polars for efficient data processing.

Features

  • Larger than memory implementation suitable for large datasets
  • Automatic detection and handling of stratification variables
  • Progress tracking with tqdm
  • Support for both CSV and Parquet file formats
  • Command-line interface for easy usage

Installation

# Install the package with pip
pip install entiny

# The CLI command 'entiny' will be automatically installed
# Verify the installation
entiny --help

The installation will automatically add the entiny command to your system. You can verify the installation by running entiny --help to see the available options.

Quick Start

import polars as pl
import numpy as np
from entiny import entiny

# Create or load your data
df = pl.DataFrame({
    "category": ["A", "A", "B", "B"] * 250,
    "value1": np.random.normal(0, 1, 1000),
    "value2": np.random.uniform(-5, 5, 1000)
})

# Sample extreme values
# This will automatically detect "category" as a stratum
# and sample extreme values within each category
result = entiny(df, n=10).collect()

Usage

Python API

from entiny import entiny

# From a DataFrame
result = entiny(df, n=10).collect()

# From a CSV file
result = entiny("data.csv", n=10).collect()

# From a Parquet file
result = entiny("data.parquet", n=10).collect()

# With custom options
result = entiny(
    data=df,
    n=10,                    # Number of extreme values to select from each end
    seed=42,                 # For reproducibility
    show_progress=True       # Show progress bars
).collect()

Command Line Interface

# Basic usage
entiny -i input.csv -o output.csv -n 10

# With all options
entiny \
    --input data.csv \
    --output sampled.csv \
    --n 10 \
    --seed 42 \
    --no-progress  # Optional: disable progress bars

How It Works

  1. Automatic Feature Detection:

    • Numeric columns are used for sampling extreme values
    • String/categorical columns are automatically detected as strata
  2. Stratified Sampling:

    • If categorical columns are present, sampling is performed within each stratum
    • For each numeric variable in each stratum:
      • Selects n highest values
      • Selects n lowest values
  3. Memory Efficiency:

    • Uses Polars' lazy evaluation
    • Processes data in chunks
    • Minimizes memory usage for large datasets

Example with Stratification

import polars as pl
import numpy as np
from entiny import entiny

# Create a dataset with multiple strata
df = pl.DataFrame({
    "region": ["North", "South"] * 500,
    "category": ["A", "B", "A", "B"] * 250,
    "sales": np.random.lognormal(0, 1, 1000),
    "quantity": np.random.poisson(5, 1000)
})

# Sample extreme values
# Will automatically detect "region" and "category" as strata
result = entiny(df, n=5).collect()

Performance Considerations

  • Uses Polars for high-performance data operations
  • Lazy evaluation minimizes memory usage
  • Progress bars show operation status
  • Efficient handling of large datasets through streaming

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entiny-0.2.5.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entiny-0.2.5-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file entiny-0.2.5.tar.gz.

File metadata

  • Download URL: entiny-0.2.5.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.11 Linux/6.8.0-1021-azure

File hashes

Hashes for entiny-0.2.5.tar.gz
Algorithm Hash digest
SHA256 f0216e4c16d7a409cb9664060bd018a25cb92dee61f1ec8a3e6e4d186a50d015
MD5 0419cfc4df523f98e73cc4d9a5959c17
BLAKE2b-256 9b4989d3014583baca44e11cc1461d0dde1a6a9c3dbe09354a03cb0e0ea559ce

See more details on using hashes here.

File details

Details for the file entiny-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: entiny-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.11.11 Linux/6.8.0-1021-azure

File hashes

Hashes for entiny-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 70193e6adc6888cb3acd06c5b5de5a19ce441e00750e0fa9abd77d0bd1475f53
MD5 a01206485d014374e1e6d854bc5a1066
BLAKE2b-256 91f5a8257b0a9086a466f9252dbf9a71f314963ab36694511faef014e83f41f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page