No project description provided
Project description
Entiny
A high-performance Python package for Information-Based Optimal Subdata Selection (IBOSS) using Polars for efficient data processing.
Features
- Larger than memory implementation suitable for large datasets
- Automatic detection and handling of stratification variables
- Progress tracking with tqdm
- Support for both CSV and Parquet file formats
- Command-line interface for easy usage
Installation
# Install the package with pip
pip install entiny
# The CLI command 'entiny' will be automatically installed
# Verify the installation
entiny --help
The installation will automatically add the entiny command to your system. You can verify the installation by running entiny --help to see the available options.
Quick Start
import polars as pl
import numpy as np
from entiny import entiny
# Create or load your data
df = pl.DataFrame({
"category": ["A", "A", "B", "B"] * 250,
"value1": np.random.normal(0, 1, 1000),
"value2": np.random.uniform(-5, 5, 1000)
})
# Sample extreme values
# This will automatically detect "category" as a stratum
# and sample extreme values within each category
result = entiny(df, n=10).collect()
Usage
Python API
from entiny import entiny
# From a DataFrame
result = entiny(df, n=10).collect()
# From a CSV file
result = entiny("data.csv", n=10).collect()
# From a Parquet file
result = entiny("data.parquet", n=10).collect()
# With custom options
result = entiny(
data=df,
n=10, # Number of extreme values to select from each end
seed=42, # For reproducibility
show_progress=True # Show progress bars
).collect()
Command Line Interface
# Basic usage
entiny -i input.csv -o output.csv -n 10
# With all options
entiny \
--input data.csv \
--output sampled.csv \
--n 10 \
--seed 42 \
--no-progress # Optional: disable progress bars
How It Works
-
Automatic Feature Detection:
- Numeric columns are used for sampling extreme values
- String/categorical columns are automatically detected as strata
-
Stratified Sampling:
- If categorical columns are present, sampling is performed within each stratum
- For each numeric variable in each stratum:
- Selects n highest values
- Selects n lowest values
-
Memory Efficiency:
- Uses Polars' lazy evaluation
- Processes data in chunks
- Minimizes memory usage for large datasets
Example with Stratification
import polars as pl
import numpy as np
from entiny import entiny
# Create a dataset with multiple strata
df = pl.DataFrame({
"region": ["North", "South"] * 500,
"category": ["A", "B", "A", "B"] * 250,
"sales": np.random.lognormal(0, 1, 1000),
"quantity": np.random.poisson(5, 1000)
})
# Sample extreme values
# Will automatically detect "region" and "category" as strata
result = entiny(df, n=5).collect()
Performance Considerations
- Uses Polars for high-performance data operations
- Lazy evaluation minimizes memory usage
- Progress bars show operation status
- Efficient handling of large datasets through streaming
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entiny-0.2.5.tar.gz.
File metadata
- Download URL: entiny-0.2.5.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.11.11 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0216e4c16d7a409cb9664060bd018a25cb92dee61f1ec8a3e6e4d186a50d015
|
|
| MD5 |
0419cfc4df523f98e73cc4d9a5959c17
|
|
| BLAKE2b-256 |
9b4989d3014583baca44e11cc1461d0dde1a6a9c3dbe09354a03cb0e0ea559ce
|
File details
Details for the file entiny-0.2.5-py3-none-any.whl.
File metadata
- Download URL: entiny-0.2.5-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.11.11 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70193e6adc6888cb3acd06c5b5de5a19ce441e00750e0fa9abd77d0bd1475f53
|
|
| MD5 |
a01206485d014374e1e6d854bc5a1066
|
|
| BLAKE2b-256 |
91f5a8257b0a9086a466f9252dbf9a71f314963ab36694511faef014e83f41f0
|