Automatic pandas DataFrame type detection with locale-aware parsing

These details have not been verified by PyPI

Project links

Project description

Pandas Type Detector

A modular, extensible system for automatically detecting and converting pandas DataFrame column types using a strategy pattern with confidence scoring and locale-aware parsing.

Features

Locale-aware parsing: Built-in support for PT-BR and EN-US numeric formats
Modular architecture: Each detector handles a specific data type
Confidence scoring: Get confidence levels for type detection decisions
Error handling modes: Choose how to handle conversion errors (coerce, raise, ignore)
Excel compatibility: Correctly handles data that Excel might misinterpret
Extensible design: Easy to add new locales or data types
Production ready: Successfully tested with 14,607+ rows in production

Installation

pip install pandas-type-detector

Quick Start

from pandas_type_detector import TypeDetectionPipeline
import pandas as pd

# Create sample DataFrame with PT-BR numeric data
data = {
    'receita': ['1.364,00', '343', '111,1', '1.950,00'],
    'nome': ['João', 'Maria', 'Pedro', 'Ana'],
    'ativo': ['sim', 'não', 'sim', 'não'],
    'data': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']
}

df = pd.DataFrame(data)

# Initialize the pipeline with PT-BR locale
pipeline = TypeDetectionPipeline(locale="pt-br")

# Automatically detect and convert all column types
df_converted = pipeline.fix_dataframe_dtypes(df)

print(df_converted.dtypes)
# Output:
# receita      float64
# nome          string  
# ativo       boolean
# data     datetime64[ns]

Locale Support

PT-BR (Portuguese Brazil)

Decimal separator: , (comma)
Thousands separator: . (dot)
Currency: R$, BRL
Boolean values: sim/não, verdadeiro/falso
Date formats: DD/MM/YYYY, YYYY-MM-DD

EN-US (English United States)

Decimal separator: . (dot)
Thousands separator: , (comma)
Currency: $, USD
Boolean values: yes/no, true/false
Date formats: MM/DD/YYYY, YYYY-MM-DD

Advanced Usage

Error Handling Modes

# Coerce errors (default) - convert invalid values to NaN
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")

# Raise exceptions on conversion errors
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="raise")

# Ignore problematic columns - leave them unchanged
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="ignore")

Individual Column Detection

# Get detailed information about type detection
result = pipeline.detect_column_type(df['receita'])

print(f"Detected type: {result.data_type.value}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Metadata: {result.metadata}")

Skip Specific Columns

# Skip certain columns during conversion
df_converted = pipeline.fix_dataframe_dtypes(
    df, 
    skip_columns=['keep_as_string', 'manual_column']
)

Data Type Detection

The package detects the following data types:

Data Type	Description	Examples
`INTEGER`	Whole numbers	`123`, `1.000` (PT-BR)
`FLOAT`	Decimal numbers	`123,45` (PT-BR), `123.45` (EN-US)
`BOOLEAN`	True/false values	`sim`/`não`, `yes`/`no`
`DATETIME`	Date and time	`2024-01-01`, `01/02/2024`
`TEXT`	String data	Any non-matching text

Real-World Examples

Excel Import Fix

# Excel often misinterprets PT-BR numbers as dates
# This package correctly identifies and converts them

excel_data = ['1.364,00', '2.500,75', '3.100,25']
df = pd.DataFrame({'revenue': excel_data})

pipeline = TypeDetectionPipeline(locale="pt-br")
df_fixed = pipeline.fix_dataframe_dtypes(df)

# Now revenue is properly converted to float64
print(df_fixed['revenue'].tolist())
# Output: [1364.0, 2500.75, 3100.25]

Mixed Data Handling

# Handle mixed valid/invalid data gracefully
messy_data = ['1.364,00', 'invalid', '111,1', '-', '', '1.950,00']
df = pd.DataFrame({'values': messy_data})

pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")
df_clean = pipeline.fix_dataframe_dtypes(df)

# Invalid values become NaN, valid ones are converted
print(df_clean['values'].tolist())
# Output: [1364.0, NaN, 111.1, NaN, NaN, 1950.0]

Production ETL Pipeline

def process_financial_data(df):
    """Production ETL function using type detector."""
    
    # Configure for strict error handling in production
    pipeline = TypeDetectionPipeline(
        locale="pt-br", 
        on_error="raise",
        sample_size=1000
    )
    
    try:
        # Convert all columns automatically
        df_processed = pipeline.fix_dataframe_dtypes(df)
        
        # Log conversion results
        print(f"Successfully processed {len(df_processed)} rows")
        return df_processed
        
    except Exception as e:
        print(f"Data quality issue detected: {e}")
        raise

Architecture

The package uses a modular strategy pattern:

# Each detector handles one specific data type
from pandas_type_detector import (
    NumericDetector,    # Handles PT-BR/EN-US numbers
    BooleanDetector,    # Handles locale-specific booleans  
    DateTimeDetector,   # Handles various date formats
    TextDetector        # Fallback for text data
)

# All coordinated by the main pipeline
pipeline = TypeDetectionPipeline(locale="pt-br")

Configuration

Custom Sample Size

# Use larger sample for better accuracy on big datasets
pipeline = TypeDetectionPipeline(
    locale="pt-br",
    sample_size=5000  # Default: 1000
)

Confidence Thresholds

# Access individual detectors for custom configuration
from pandas_type_detector import NumericDetector, LOCALES

detector = NumericDetector(
    locale_config=LOCALES["pt-br"],
    min_confidence=0.9  # Higher threshold
)

Testing

Run the comprehensive test suite:

cd pandas-type-detector
python tests/test.py

The test suite includes:

PT-BR numeric format validation
Error handling verification
Boolean detection tests
DateTime parsing tests
Excel compatibility tests
Real-world scenario validation

Performance

Optimized sampling: Only analyzes a configurable sample of rows
Early exit: Stops detection when high confidence is reached
Minimal overhead: Designed for production ETL pipelines
Memory efficient: Processes columns independently

Contributing

The modular design makes it easy to contribute:

Adding a New Locale

from pandas_type_detector import LOCALES, LocaleConfig

# Add Spanish locale
LOCALES['es-es'] = LocaleConfig(
    name='es-es',
    decimal_separator=',',
    thousands_separator='.',
    currency_symbols=['€', 'EUR'],
    date_formats=[r'^\d{1,2}/\d{1,2}/\d{4}$']
)

Adding a New Detector

from pandas_type_detector import TypeDetector, DataType, DetectionResult

class URLDetector(TypeDetector):
    def detect(self, series):
        # Implementation here
        pass
    
    def convert(self, series):
        # Implementation here  
        pass

Requirements

Python 3.7+
pandas >= 1.0.0
numpy >= 1.19.0

License

MIT License - see LICENSE file for details.

Acknowledgments

Developed to solve real-world data quality issues in Brazilian financial data processing. Successfully deployed in production handling 14,607+ rows of complex PT-BR formatted data.

Support

Issues: GitHub Issues
Documentation: This README
Examples: See tests/test.py for comprehensive usage examples

Made with ❤️ for the pandas community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Aug 21, 2025

0.9.1

Aug 20, 2025

This version

0.9.0

Aug 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_type_detector-0.9.0.tar.gz (11.8 kB view details)

Uploaded Aug 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_type_detector-0.9.0-py3-none-any.whl (10.7 kB view details)

Uploaded Aug 20, 2025 Python 3

File details

Details for the file pandas_type_detector-0.9.0.tar.gz.

File metadata

Download URL: pandas_type_detector-0.9.0.tar.gz
Upload date: Aug 20, 2025
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`0c56fba363dc003cfb62e919f0608b22452b254999b9c8da9e2f70aaf56e7f93`
MD5	`f70f9990438a304b1c02ffc6638aad08`
BLAKE2b-256	`f7826835e814849faa2fc643f4cab4f369e89947e5db70ca4182fc737fc39bac`

See more details on using hashes here.

File details

Details for the file pandas_type_detector-0.9.0-py3-none-any.whl.

File metadata

Download URL: pandas_type_detector-0.9.0-py3-none-any.whl
Upload date: Aug 20, 2025
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83dbbe406865bf41e8d2d7c4ff954862adfd23258d2986d966e77c4e6a501258`
MD5	`177211c3970c1ae2736abdbb61c8b0f6`
BLAKE2b-256	`5420d8e55f649fef676564bdd16d999884115b501ab9a542ed1ede4a3b4ad888`

See more details on using hashes here.

pandas-type-detector 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pandas Type Detector

Features

Installation

Quick Start

Locale Support

PT-BR (Portuguese Brazil)

EN-US (English United States)

Advanced Usage

Error Handling Modes

Individual Column Detection

Skip Specific Columns

Data Type Detection

Real-World Examples

Excel Import Fix

Mixed Data Handling

Production ETL Pipeline

Architecture

Configuration

Custom Sample Size

Confidence Thresholds

Testing

Performance

Contributing

Adding a New Locale

Adding a New Detector

Requirements

License

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes