Skip to main content

Intelligent imputation analysis with automatic data validation and metadata inference

Project description

FunPuter v1.3.6 - Intelligent Imputation Analysis

Python 3.9+ PyPI License: MIT Test Coverage

Production-ready intelligent imputation analysis with automatic data validation and metadata inference.

FunPuter analyzes your data and suggests the best imputation methods based on:

  • ๐Ÿค– 15 metadata fields automatically inferred
  • ๐Ÿ” Missing data mechanisms (MCAR, MAR, MNAR detection)
  • ๐Ÿ“Š Data types and statistical properties
  • โšก Metadata constraints (nullable, allowed_values, max_length validation)
  • ๐Ÿ›ก๏ธ Automatic data validation and recommendations
  • ๐ŸŽฏ Adaptive thresholds based on your dataset characteristics

๐Ÿš€ Quick Start

Installation

pip install funputer

30-Second Demo

๐Ÿค– Auto-Inference Mode (Zero Configuration!)

import funputer

# Just point to your CSV - FunPuter figures out everything automatically!
suggestions = funputer.analyze_imputation_requirements("your_data.csv")

# Get intelligent suggestions
for suggestion in suggestions:
    if suggestion.missing_count > 0:
        print(f"๐Ÿ“Š {suggestion.column_name}: {suggestion.proposed_method}")
        print(f"   Confidence: {suggestion.confidence_score:.3f}")
        print(f"   Reason: {suggestion.rationale}")
        print(f"   Missing: {suggestion.missing_count} ({suggestion.missing_percentage:.1f}%)")

๐Ÿ“‹ Production Mode (Full Control)

import funputer
from funputer.models import ColumnMetadata

# Define your data structure with constraints
metadata = [
    ColumnMetadata('customer_id', 'integer', unique_flag=True),
    ColumnMetadata('age', 'integer', min_value=18, max_value=100),
    ColumnMetadata('income', 'float', min_value=0),
    ColumnMetadata('category', 'categorical', allowed_values='A,B,C'),
]

# Get production-grade suggestions
suggestions = funputer.analyze_dataframe(your_dataframe, metadata)

๐Ÿ–ฅ๏ธ Command Line Interface

# Auto-inference - easiest way
funputer analyze -d your_data.csv

# Production analysis with metadata
funputer analyze -d your_data.csv -m metadata.csv --verbose

# Data quality check first
funputer preflight -d your_data.csv

# Generate metadata template
funputer init -d your_data.csv -o metadata.csv

๐Ÿšจ IMPORTANT: v1.3.0 Breaking Change

๐ŸŽฏ Consistent Naming: Starting with v1.3.0, all imports and CLI commands use consistent funputer naming:

# โœ… NEW (v1.3.0+): Consistent naming
import funputer
funputer.analyze_imputation_requirements("data.csv")
# โœ… NEW CLI command (v1.3.0+)
funputer analyze -d data.csv

๐Ÿ”„ Migration: For backward compatibility, old imports still work with deprecation warnings:

# โš ๏ธ DEPRECATED (still works but shows warning)
import funimpute
# Old funimputer CLI command also still works

๐Ÿ“… Timeline: Deprecated imports will be removed in v2.0.0. Please update your code!

๐ŸŽฏ Enhanced Features (v1.3.0)

What's New in v1.3.0:

  • ๐ŸŽฏ Consistent Naming: All imports and CLI use funputer (backward compatible)
  • ๐Ÿ”„ JSON Metadata Support: SimpleImputationAnalyzer now handles both CSV and JSON metadata formats
  • ๐Ÿ“‹ Enhanced Documentation: Updated examples and migration guides

Previous Features (v1.2.1):

  • ๐Ÿšจ Data Validation System: Comprehensive checks that run before analysis to prevent crashes
  • ๐Ÿ” Smart Auto-Inference: Intelligent metadata detection with confidence scoring
  • โšก Constraint Validation: Real-time nullable, allowed_values, and max_length checking
  • ๐ŸŽฏ Enhanced Proposals: Metadata-aware imputation method selection
  • ๐Ÿ›ก๏ธ Exception Detection: Comprehensive constraint violation handling
  • ๐Ÿ“ˆ Improved Confidence: Dynamic scoring based on metadata compliance
  • ๐Ÿงน Warning Suppression: Clean output with optimized pandas datetime parsing
  • โœ… Quality Assurance: 71% overall test coverage with comprehensive test suite

๐Ÿšจ Data Validation System (NEW!)

Fast validation to prevent crashes and guide your workflow

What the Validation System Does

  • Runs automatically before init and analyze commands
  • Comprehensive checks: file access, format detection, encoding, structure, memory estimation
  • Advisory recommendations: "generate metadata first" vs "analyze now"
  • Zero crashes: Catches problems before they break your workflow
  • Backward compatible: All existing commands work exactly as before

Independent Usage

# Basic validation check
funputer preflight -d your_data.csv

# With custom options
funputer preflight -d data.csv --sample-rows 5000 --encoding utf-8

# JSON report output
funputer preflight -d data.csv --json-out report.json

Exit Codes

  • 0: โœ… Ready for analysis
  • 2: โš ๏ธ OK with warnings (can proceed)
  • 10: โŒ Hard error (cannot proceed)

Example Output

๐Ÿ” VALIDATION REPORT
==================================================
Status: โœ… OK
File: data.csv
Size: 2.5 MB (csv)  
Columns: 12
Recommendation: Analyze Infer Only

FunPuter now supports comprehensive metadata fields that actively influence imputation recommendations:

Metadata Schema

Field Type Description Example
column_name string Column identifier "age"
data_type string Data type (integer, float, string, categorical, datetime) "integer"
nullable boolean Allow null values false
min_value number Minimum allowed value 0
max_value number Maximum allowed value 120
max_length integer Maximum string length 50
allowed_values string Comma-separated list of allowed values "A,B,C"
unique_flag boolean Require unique values true
dependent_column string Column dependencies "age"
business_rule string Custom validation rules "Must be positive"
description string Human-readable description "User age in years"

๐Ÿ› ๏ธ Creating Metadata

Method 1: CLI Template Generation

# Generate a metadata template from your data
funputer init -d data.csv -o metadata.csv

# Edit the generated file to add constraints
# Then analyze with enhanced metadata
funputer analyze -d data.csv -m metadata.csv

Method 2: Manual CSV Creation

# metadata.csv
# column_name,data_type,nullable,min_value,max_value,max_length,allowed_values,unique_flag,dependent_column,business_rule,description
user_id,integer,false,,,50,,true,,,"Unique user identifier"
age,integer,false,0,120,,,,,Must be positive,"User age in years"
income,float,true,0,,,,,age,Higher with age,"Annual income in USD"
category,categorical,false,,,10,"A,B,C",,,,"User category classification"
email,string,true,,,255,,true,,,"User email address"

๐ŸŽฏ Metadata in Action

Example 1: Nullable Constraints

# When nullable=False but data has missing values
metadata = ColumnMetadata(
    column_name="age",
    data_type="integer",
    nullable=False,
    min_value=0,
    max_value=120
)

# FunPuter will:
# - Detect nullable constraint violations
# - Recommend immediate data quality fixes
# - Lower confidence score due to constraint violations

Example 2: Allowed Values

# For categorical data with specific allowed values
metadata = ColumnMetadata(
    column_name="status",
    data_type="categorical",
    allowed_values="active,inactive,pending"
)

# FunPuter will:
# - Validate all values against allowed list
# - Recommend mode imputation using only allowed values
# - Increase confidence when data respects constraints

Example 3: String Length Constraints

# For string data with length limits
metadata = ColumnMetadata(
    column_name="username",
    data_type="string",
    max_length=20,
    unique_flag=True
)

# FunPuter will:
# - Check string lengths against max_length
# - Recommend imputation respecting length limits
# - Consider uniqueness requirements in recommendations

๐Ÿ“Š Enhanced Analysis Results

# Results include comprehensive imputation analysis
for suggestion in suggestions:
    print(f"Column: {suggestion.column_name}")
    print(f"Method: {suggestion.proposed_method}")
    print(f"Confidence: {suggestion.confidence_score:.3f}")
    print(f"Rationale: {suggestion.rationale}")
    print(f"Missing: {suggestion.missing_count} ({suggestion.missing_percentage:.1f}%)")
    
    # Outlier information when relevant
    if suggestion.outlier_count > 0:
        print(f"Outliers: {suggestion.outlier_count} ({suggestion.outlier_percentage:.1f}%)")
        print(f"Outlier handling: {suggestion.outlier_handling}")

๐Ÿ” Confidence-Score Heuristics

FunPuter assigns a confidence_score (range 0 โ€“ 1) to every imputation recommendation. The value is a transparent, rule-based estimate of how reliable the proposed method is, not a formal statistical uncertainty. Two calculators are used:

Base heuristic

When only column-level data is available (no full DataFrame), the score is computed as follows:

Signal Condition ฮ” Score
Starting value 0.50
Missing % < 5 % +0.20 โ€ข 5 โ€“ 20 % +0.10 โ€ข > 50 % โˆ’0.20
Mechanism MCAR (weak evidence) +0.10 โ€ข MAR (related cols) +0.05 โ€ข MNAR/UNKNOWN โˆ’0.10
Outliers < 5 % +0.05 โ€ข > 20 % โˆ’0.10
Metadata constraints allowed_values (categorical/string) +0.10 โ€ข max_length (string) +0.05
Nullable constraint nullable=False with missing โˆ’0.15 โ€ข without missing +0.05
Data-quality checks Strings within max_length +0.05 โ€ข Categorical values inside allowed_values + (valid_ratio ร— 0.10)

The final score is clipped to the [0.10, 1.00] interval.

Adaptive variant

When the analyzer receives the full DataFrame and complete metadata, it builds dataset-specific thresholds using AdaptiveThresholds and applies calculate_adaptive_confidence_score:

  • Adaptive missing/outlier thresholds (based on row-count, variability, etc.)
  • An additional adjustment factor (โˆ’0.30 โ€ฆ +0.30) reflecting dataset characteristics

This yields a context-aware score that remains interpretable yet sensitive to each dataset.

Future work

For maximum transparency and speed we use heuristics today. Future releases may include probabilistic or conformal approaches (e.g., multiple-imputation variance or ensemble uncertainty) to provide statistically grounded confidence estimates.

๐Ÿš€ Advanced Usage

Programmatic Metadata Creation

from funputer.models import ColumnMetadata

metadata = [
    ColumnMetadata(
        column_name="product_code",
        data_type="string",
        max_length=10,
        allowed_values="A1,A2,B1,B2",
        nullable=False,
        description="Product classification code"
    ),
    ColumnMetadata(
        column_name="price",
        data_type="float",
        min_value=0,
        max_value=10000,
        business_rule="Must be non-negative"
    )
]

# Analyze with custom metadata
import pandas as pd
data = pd.read_csv("products.csv")
from funputer.simple_analyzer import SimpleImputationAnalyzer

analyzer = SimpleImputationAnalyzer()
results = analyzer.analyze_dataframe(data, metadata)

CLI Usage with Enhanced Metadata & PREFLIGHT

# PREFLIGHT runs automatically before init/analyze
funputer init -d products.csv -o products_metadata.csv
# ๐Ÿ” Preflight Check: โœ… OK - File validated, ready for processing

# Edit metadata.csv to add constraints, then:
funputer analyze -d products.csv -m products_metadata.csv -o results.csv
# ๐Ÿ” Preflight Check: โœ… OK - Recommendation: Analyze Now

# Run standalone preflight validation
funputer preflight -d products.csv --json-out validation_report.json

# Disable preflight if needed (not recommended)
export FUNPUTER_PREFLIGHT=off
funputer analyze -d products.csv

# Results are automatically saved in CSV format for easy viewing

๐Ÿ“‹ Requirements

  • Python: 3.9 or higher
  • Dependencies: pandas, numpy, scipy, scikit-learn

๐Ÿ”ง Installation from Source

git clone https://github.com/RajeshRamachander/funputer.git
cd funputer
pip install -e .

๐Ÿ“š Complete Usage Examples

FunPuter provides comprehensive examples for every use case:

๐Ÿ“Š Usage Patterns

Auto-Inference (Zero Configuration)

# Perfect for data exploration and prototyping
suggestions = funputer.analyze_imputation_requirements("mystery_data.csv")

Production Mode (Full Control)

# Enterprise-grade with constraint validation
from funputer.models import ColumnMetadata, AnalysisConfig

metadata = [
    ColumnMetadata('customer_id', 'integer', unique_flag=True, nullable=False),
    ColumnMetadata('age', 'integer', min_value=18, max_value=100),
    ColumnMetadata('income', 'float', dependent_column='age', 
                   business_rule='Income correlates with age'),
    ColumnMetadata('category', 'categorical', allowed_values='A,B,C,D')
]

config = AnalysisConfig(missing_percentage_threshold=0.25, skip_columns=['id'])
suggestions = funputer.analyze_dataframe(df, metadata, config)

CLI Automation

# Batch processing workflow
for file in data/*.csv; do
    funputer preflight "$file" && \
    funputer analyze -d "$file" --output "results/$(basename "$file" .csv)_plan.csv"
done

๐Ÿญ Industry-Specific Examples

E-commerce Customer Analytics

# Customer behavior analysis with business constraints
metadata = [
    ColumnMetadata('customer_id', 'integer', unique_flag=True, nullable=False),
    ColumnMetadata('age', 'integer', min_value=13, max_value=120),
    ColumnMetadata('annual_income', 'float', min_value=0, dependent_column='age'),
    ColumnMetadata('customer_segment', 'categorical', allowed_values='Premium,Standard,Basic'),
    ColumnMetadata('churn_risk_score', 'float', min_value=0.0, max_value=1.0),
]
suggestions = funputer.analyze_dataframe(customer_df, metadata)

Healthcare Patient Records

# Clinical data with regulatory compliance
metadata = [
    ColumnMetadata('patient_id', 'integer', unique_flag=True, do_not_impute=True),
    ColumnMetadata('age', 'integer', min_value=0, max_value=150, nullable=False),
    ColumnMetadata('blood_pressure_systolic', 'integer', min_value=50, max_value=300),
    ColumnMetadata('diagnosis_code', 'categorical', allowed_values='A00-Z99', nullable=False),
    ColumnMetadata('treatment_response', 'categorical', allowed_values='Excellent,Good,Fair,Poor'),
]
config = AnalysisConfig(missing_threshold=0.10)  # Healthcare = low tolerance
suggestions = funputer.analyze_dataframe(patient_df, metadata, config)

Financial Risk Assessment

# Credit scoring with business rules
metadata = [
    ColumnMetadata('application_id', 'integer', unique_flag=True, nullable=False),
    ColumnMetadata('credit_score', 'integer', min_value=300, max_value=850),
    ColumnMetadata('debt_to_income', 'float', min_value=0.0, max_value=10.0),
    ColumnMetadata('loan_purpose', 'categorical', allowed_values='home,auto,personal,business'),
    ColumnMetadata('employment_status', 'categorical', nullable=False),
]
# Skip sensitive columns from imputation
config = AnalysisConfig(skip_columns=['ssn', 'account_number'])
suggestions = funputer.analyze_dataframe(loan_df, metadata, config)

IoT Sensor Data Processing

# Time series sensor data with equipment monitoring
metadata = [
    ColumnMetadata('sensor_id', 'categorical', unique_flag=False, group_by=True),
    ColumnMetadata('timestamp', 'datetime', time_index=True, nullable=False),
    ColumnMetadata('temperature', 'float', min_value=-40, max_value=150),
    ColumnMetadata('pressure', 'float', min_value=0, max_value=1000),
    ColumnMetadata('equipment_status', 'categorical', allowed_values='operational,maintenance,fault'),
]
# Lower correlation threshold for noisy sensor data
config = AnalysisConfig(correlation_threshold=0.2, outlier_threshold=0.15)
suggestions = funputer.analyze_dataframe(sensor_df, metadata, config)

๐ŸŽ“ Learning Path

  1. Start Here: Try the 30-second demo above - Master the basics instantly
  2. Go Deeper: Explore production mode with metadata and constraints
  3. Real World: Apply patterns to your specific industry domain
  4. CLI Mastery: Automate workflows with command-line tools
  5. Production: Scale with batch processing and CI/CD integration

๐Ÿ’ก Pro Tips

  • Exploration: Use auto-inference for quick insights
  • Production: Always use explicit metadata with constraints
  • Automation: CLI is perfect for CI/CD and batch processing
  • Validation: Run preflight checks before expensive analysis
  • Performance: Skip unnecessary columns, tune thresholds appropriately

๐Ÿ“š Documentation

  • API Reference: Complete docstrings and type hints in the codebase
  • Test Coverage: htmlcov/ - Detailed coverage reports (77%)

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

๐Ÿ“„ License

MIT License - see LICENSE file for details.


Focus: Get intelligent imputation recommendations with enhanced metadata support, not complex infrastructure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

funputer-1.3.6.tar.gz (84.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

funputer-1.3.6-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file funputer-1.3.6.tar.gz.

File metadata

  • Download URL: funputer-1.3.6.tar.gz
  • Upload date:
  • Size: 84.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for funputer-1.3.6.tar.gz
Algorithm Hash digest
SHA256 5801af8a192e1a596e768de4f8f0f34ad07823c66ac4126ca6f9682a8f7703a3
MD5 dd9f51e27645b78ca748071083c7142c
BLAKE2b-256 ef001382fc385078c2dcc24334f8d1d76bab054b1dd0472180846189ce286b52

See more details on using hashes here.

File details

Details for the file funputer-1.3.6-py3-none-any.whl.

File metadata

  • Download URL: funputer-1.3.6-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for funputer-1.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 14a8582a8c178ec61e3951059a3bd1bd8ca3c9e2966640889187a781cebf25df
MD5 9a18f37aa85ea694c3bfa89b2a2bf538
BLAKE2b-256 b31a1b850a07a8ac15f49bd67d117756bfa6e6714f9413a98043be053dce92f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page