Skip to main content

Automatic pandas DataFrame type detection with locale-aware parsing

Project description

Pandas Type Detector

A modular, extensible system for automatically detecting and converting pandas DataFrame column types using a strategy pattern with confidence scoring and locale-aware parsing.

Features

  • Locale-aware parsing: Built-in support for PT-BR and EN-US numeric formats
  • Modular architecture: Each detector handles a specific data type
  • Confidence scoring: Get confidence levels for type detection decisions
  • Error handling modes: Choose how to handle conversion errors (coerce, raise, ignore)
  • Excel compatibility: Correctly handles data that Excel might misinterpret
  • Extensible design: Easy to add new locales or data types
  • Production ready: Successfully tested with 14,607+ rows in production

Installation

pip install pandas-type-detector

Quick Start

from pandas_type_detector import TypeDetectionPipeline
import pandas as pd

# Create sample DataFrame with PT-BR numeric data
data = {
    'receita': ['1.364,00', '343', '111,1', '1.950,00'],
    'nome': ['João', 'Maria', 'Pedro', 'Ana'],
    'ativo': ['sim', 'não', 'sim', 'não'],
    'data': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']
}

df = pd.DataFrame(data)

# Initialize the pipeline with PT-BR locale
pipeline = TypeDetectionPipeline(locale="pt-br")

# Automatically detect and convert all column types
df_converted = pipeline.fix_dataframe_dtypes(df)

print(df_converted.dtypes)
# Output:
# receita      float64
# nome          string  
# ativo       boolean
# data     datetime64[ns]

Locale Support

PT-BR (Portuguese Brazil)

  • Decimal separator: , (comma)
  • Thousands separator: . (dot)
  • Currency: R$, BRL
  • Boolean values: sim/não, verdadeiro/falso
  • Date formats: DD/MM/YYYY, YYYY-MM-DD

EN-US (English United States)

  • Decimal separator: . (dot)
  • Thousands separator: , (comma)
  • Currency: $, USD
  • Boolean values: yes/no, true/false
  • Date formats: MM/DD/YYYY, YYYY-MM-DD

Advanced Usage

Error Handling Modes

# Coerce errors (default) - convert invalid values to NaN
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")

# Raise exceptions on conversion errors
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="raise")

# Ignore problematic columns - leave them unchanged
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="ignore")

Individual Column Detection

# Get detailed information about type detection
result = pipeline.detect_column_type(df['receita'])

print(f"Detected type: {result.data_type.value}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Metadata: {result.metadata}")

Skip Specific Columns

# Skip certain columns during conversion
df_converted = pipeline.fix_dataframe_dtypes(
    df, 
    skip_columns=['keep_as_string', 'manual_column']
)

Data Type Detection

The package detects the following data types:

Data Type Description Examples
INTEGER Whole numbers 123, 1.000 (PT-BR)
FLOAT Decimal numbers 123,45 (PT-BR), 123.45 (EN-US)
BOOLEAN True/false values sim/não, yes/no
DATETIME Date and time 2024-01-01, 01/02/2024
TEXT String data Any non-matching text

Real-World Examples

Excel Import Fix

# Excel often misinterprets PT-BR numbers as dates
# This package correctly identifies and converts them

excel_data = ['1.364,00', '2.500,75', '3.100,25']
df = pd.DataFrame({'revenue': excel_data})

pipeline = TypeDetectionPipeline(locale="pt-br")
df_fixed = pipeline.fix_dataframe_dtypes(df)

# Now revenue is properly converted to float64
print(df_fixed['revenue'].tolist())
# Output: [1364.0, 2500.75, 3100.25]

Mixed Data Handling

# Handle mixed valid/invalid data gracefully
messy_data = ['1.364,00', 'invalid', '111,1', '-', '', '1.950,00']
df = pd.DataFrame({'values': messy_data})

pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")
df_clean = pipeline.fix_dataframe_dtypes(df)

# Invalid values become NaN, valid ones are converted
print(df_clean['values'].tolist())
# Output: [1364.0, NaN, 111.1, NaN, NaN, 1950.0]

Production ETL Pipeline

def process_financial_data(df):
    """Production ETL function using type detector."""
    
    # Configure for strict error handling in production
    pipeline = TypeDetectionPipeline(
        locale="pt-br", 
        on_error="raise",
        sample_size=1000
    )
    
    try:
        # Convert all columns automatically
        df_processed = pipeline.fix_dataframe_dtypes(df)
        
        # Log conversion results
        print(f"Successfully processed {len(df_processed)} rows")
        return df_processed
        
    except Exception as e:
        print(f"Data quality issue detected: {e}")
        raise

Architecture

The package uses a modular strategy pattern:

# Each detector handles one specific data type
from pandas_type_detector import (
    NumericDetector,    # Handles PT-BR/EN-US numbers
    BooleanDetector,    # Handles locale-specific booleans  
    DateTimeDetector,   # Handles various date formats
    TextDetector        # Fallback for text data
)

# All coordinated by the main pipeline
pipeline = TypeDetectionPipeline(locale="pt-br")

Configuration

Custom Sample Size

# Use larger sample for better accuracy on big datasets
pipeline = TypeDetectionPipeline(
    locale="pt-br",
    sample_size=5000  # Default: 1000
)

Confidence Thresholds

# Access individual detectors for custom configuration
from pandas_type_detector import NumericDetector, LOCALES

detector = NumericDetector(
    locale_config=LOCALES["pt-br"],
    min_confidence=0.9  # Higher threshold
)

Testing

Run the comprehensive test suite:

cd pandas-type-detector
python tests/test.py

The test suite includes:

  • PT-BR numeric format validation
  • Error handling verification
  • Boolean detection tests
  • DateTime parsing tests
  • Excel compatibility tests
  • Real-world scenario validation

Performance

  • Optimized sampling: Only analyzes a configurable sample of rows
  • Early exit: Stops detection when high confidence is reached
  • Minimal overhead: Designed for production ETL pipelines
  • Memory efficient: Processes columns independently

Contributing

The modular design makes it easy to contribute:

Adding a New Locale

from pandas_type_detector import LOCALES, LocaleConfig

# Add Spanish locale
LOCALES['es-es'] = LocaleConfig(
    name='es-es',
    decimal_separator=',',
    thousands_separator='.',
    currency_symbols=['€', 'EUR'],
    date_formats=[r'^\d{1,2}/\d{1,2}/\d{4}$']
)

Adding a New Detector

from pandas_type_detector import TypeDetector, DataType, DetectionResult

class URLDetector(TypeDetector):
    def detect(self, series):
        # Implementation here
        pass
    
    def convert(self, series):
        # Implementation here  
        pass

Requirements

  • Python 3.7+
  • pandas >= 1.0.0
  • numpy >= 1.19.0

License

MIT License - see LICENSE file for details.

Acknowledgments

Developed to solve real-world data quality issues in Brazilian financial data processing. Successfully deployed in production handling 14,607+ rows of complex PT-BR formatted data.

Support

  • Issues: GitHub Issues
  • Documentation: This README
  • Examples: See tests/test.py for comprehensive usage examples

Made with ❤️ for the pandas community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_type_detector-0.9.1.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_type_detector-0.9.1-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file pandas_type_detector-0.9.1.tar.gz.

File metadata

  • Download URL: pandas_type_detector-0.9.1.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-0.9.1.tar.gz
Algorithm Hash digest
SHA256 139a71cf9590ab420bfdb0a1b241eff2a938dcf626834aa3248937bab2978642
MD5 b651242988eb5999e4c11c8f25cfa97f
BLAKE2b-256 5fefa0e84d27292e53882ffc2f180dd960f593687be2caf0bf82599e8b25f8c0

See more details on using hashes here.

File details

Details for the file pandas_type_detector-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: pandas_type_detector-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5b36a717de3c6c74362c6823e3a67b63518a35d2dbd1c83484b7f46cc648a7d
MD5 4842020b5be793b323b9500496c93ddf
BLAKE2b-256 a79b4629604bce5c08870975bb7746272766be231f990e9a77901a243daf2243

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page