Skip to main content

Intelligent DataFrame type detection with sophisticated locale-aware parsing, confidence scoring, and smart text filtering

Project description

pandas-type-detector

🔍 Intelligent DataFrame Type Detection with Locale Awareness

A robust, production-ready library for automatically detecting and converting pandas DataFrame column types with sophisticated locale-aware parsing, confidence scoring, and enhanced text filtering capabilities.

Python 3.7+ Tests License: MIT

🚀 Key Features

  • 🌍 Locale-Aware Parsing: Native support for PT-BR and EN-US number formats, dates, and boolean values
  • 🎯 Smart Text Filtering: Advanced algorithms prevent text containing numbers from being misclassified as numeric
  • 📊 Confidence Scoring: Get reliability scores for each type detection decision
  • 🛡️ Robust Error Handling: Configurable strategies for handling conversion errors
  • ⚡ Performance Optimized: Intelligent sampling and early-exit strategies for large datasets
  • 🧩 Modular Architecture: Extensible design for adding new data types and locales
  • ✅ Production Tested: Successfully handles complex real-world data scenarios

📦 Installation

pip install pandas-type-detector

🎯 Quick Start

import pandas as pd
from pandas_type_detector import TypeDetectionPipeline

# Sample data with mixed formats
data = {
    'revenue': ['1.234,56', '2.890,00', '543,21'],      # PT-BR currency format
    'quantity': ['10', '25', '8'],                       # Integers
    'active': ['Sim', 'Não', 'Sim'],                    # PT-BR booleans
    'date': ['2025-01-15', '2025-02-20', '2025-03-10'], # ISO dates
    'description': ['(31) Product A', '(45) Service B', '(12) Item C']  # Text with numbers
}

df = pd.DataFrame(data)
print("Original dtypes:")
print(df.dtypes)
# All columns are 'object' initially

# Initialize pipeline with Portuguese (Brazil) locale
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")

# Automatically detect and convert types
df_converted = pipeline.fix_dataframe_dtypes(df)

print("\\nConverted dtypes:")
print(df_converted.dtypes)
# Output:
# revenue        float64    ← Correctly parsed PT-BR format
# quantity         Int64    ← Detected as integer
# active         boolean    ← Portuguese booleans converted
# date      datetime64[ns]  ← ISO dates parsed
# description       object  ← Text with numbers kept as text

🌐 Locale Support

🇧🇷 PT-BR (Portuguese Brazil)

  • Decimal separator: , (comma) → 1.234,56 becomes 1234.56
  • Thousands separator: . (dot) → 1.000.000,00
  • Currency symbols: R$, BRL
  • Boolean values: Sim/Não, Verdadeiro/Falso, S/N
  • Date formats: DD/MM/YYYY, YYYY-MM-DD

🇺🇸 EN-US (English United States)

  • Decimal separator: . (dot) → 1,234.56
  • Thousands separator: , (comma) → 1,000,000.00
  • Currency symbols: $, USD
  • Boolean values: True/False, Yes/No, Y/N
  • Date formats: MM/DD/YYYY, YYYY-MM-DD

📚 Advanced Usage

🔧 Error Handling Strategies

# Strategy 1: Coerce errors to NaN (default - recommended)
pipeline = TypeDetectionPipeline(locale="en-us", on_error="coerce")
df_safe = pipeline.fix_dataframe_dtypes(df)

# Strategy 2: Raise exceptions on conversion errors
pipeline = TypeDetectionPipeline(locale="en-us", on_error="raise")
try:
    df_strict = pipeline.fix_dataframe_dtypes(df)
except ValueError as e:
    print(f"Conversion error: {e}")

# Strategy 3: Ignore problematic columns
pipeline = TypeDetectionPipeline(locale="en-us", on_error="ignore")
df_conservative = pipeline.fix_dataframe_dtypes(df)

🔍 Individual Column Analysis

# Get detailed detection information
result = pipeline.detect_column_type(df['revenue'])

print(f"Detected type: {result.data_type.value}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Locale: {result.metadata['locale']}")
print(f"Parsing details: {result.metadata}")

# Example output:
# Detected type: float
# Confidence: 95.00%
# Locale: pt-br
# Parsing details: {'locale': 'pt-br', 'is_integer': False, 'numeric_count': 3, ...}

🎛️ Column Selection and Skipping

# Skip specific columns during conversion
df_converted = pipeline.fix_dataframe_dtypes(
    df, 
    skip_columns=['id', 'raw_text', 'keep_as_string']
)

# Skip columns remain as original 'object' type
# Other columns are automatically converted

⚙️ Performance Tuning

# Optimize for large datasets
pipeline = TypeDetectionPipeline(
    locale="pt-br",
    sample_size=5000,      # Analyze up to 5000 rows per column (default: 1000)
    on_error="coerce"
)

# For smaller datasets, use full analysis
pipeline = TypeDetectionPipeline(
    locale="en-us", 
    sample_size=10000      # Effectively analyze all rows for small datasets
)

🛡️ Smart Text Filtering

One of the key improvements in this library is sophisticated text filtering that prevents common misclassification issues:

# These text values are correctly identified as text, not numeric
problematic_data = pd.Series([
    "(31) Week from 28/jul to 3/aug",  # Text with numbers
    "(45) Product description",        # Text with parenthetical numbers  
    "Order #12345 - Item A",           # Mixed text and numbers
    "Section 3.1.4 Overview"          # Version numbers in text
])

result = pipeline.detect_column_type(problematic_data)
print(result.data_type)  # DataType.TEXT (correctly identified as text)

🧪 Testing

The library includes a comprehensive test suite with 17 test cases covering all functionality:

cd pandas-type-detector
poetry run pytest tests/test.py -v

Test Coverage

  • ✅ Numeric detection (integers, floats) for both locales
  • ✅ Boolean detection in multiple languages
  • ✅ DateTime parsing and conversion
  • ✅ Text-with-numbers rejection algorithms
  • ✅ Skip columns functionality
  • ✅ Error handling strategies
  • ✅ Real-world data scenarios
  • ✅ Edge cases and boundary conditions

📊 Supported Data Types

Data Type Description Example Values
Integer Whole numbers 123, 1.000 (PT-BR), 1,000 (EN-US)
Float Decimal numbers 123,45 (PT-BR), 123.45 (EN-US)
Boolean True/False values Sim/Não (PT-BR), Yes/No (EN-US)
DateTime Date and time 2025-01-15, 15/01/2025
Text String data Any text, including mixed alphanumeric

🔧 Extensibility

Adding a New Locale

from pandas_type_detector import LOCALES, LocaleConfig

# Add German locale
LOCALES['de-de'] = LocaleConfig(
    name='de-de',
    decimal_separator=',',
    thousands_separator='.',
    currency_symbols=['€', 'EUR'],
    date_formats=[r'^\\d{1,2}\\.\\d{1,2}\\.\\d{4}$']  # DD.MM.YYYY
)

# Use the new locale
pipeline = TypeDetectionPipeline(locale="de-de")

Creating Custom Detectors

from pandas_type_detector import TypeDetector, DataType, DetectionResult

class EmailDetector(TypeDetector):
    def detect(self, series):
        # Custom email detection logic
        email_pattern = r'^[\\w\\.-]+@[\\w\\.-]+\\.[\\w]+$'
        matches = series.str.match(email_pattern).sum()
        confidence = matches / len(series)
        
        if confidence >= 0.8:
            return DetectionResult(DataType.TEXT, confidence, {"format": "email"})
        return DetectionResult(DataType.UNKNOWN, confidence, {})
    
    def convert(self, series):
        # Email-specific processing if needed
        return series.astype(str)

🚀 Performance Characteristics

  • Memory Efficient: Processes columns independently without loading entire dataset
  • Sampling Strategy: Configurable sampling reduces processing time for large datasets
  • Early Exit: Stops analysis when high confidence is reached (≥90%)
  • Production Ready: Optimized for ETL pipelines and data processing workflows

Benchmarks

  • ✅ Tested with datasets up to 14,607 rows in production
  • ✅ Handles complex mixed-format data reliably
  • ✅ Minimal performance overhead on modern hardware

🤝 Contributing

We welcome contributions! The modular architecture makes it easy to:

  1. Add new locales - Extend LOCALES configuration
  2. Create new detectors - Inherit from TypeDetector base class
  3. Improve algorithms - Enhance existing detection logic
  4. Add test cases - Expand the test suite for new scenarios

Development Setup

git clone https://github.com/yourusername/pandas-type-detector
cd pandas-type-detector
poetry install
poetry run pytest

📋 Requirements

  • Python: 3.7+ (tested on 3.7, 3.8, 3.9, 3.10, 3.11, 3.12)
  • pandas: ≥1.0.0
  • numpy: ≥1.19.0

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

This library was developed to solve real-world data quality challenges in Brazilian financial and business data processing. It has been successfully deployed in production environments handling complex PT-BR formatted datasets.

Special thanks to the pandas and NumPy communities for providing the foundation that makes this work possible.

📞 Support

  • 🐛 Bug Reports: GitHub Issues
  • 💡 Feature Requests: GitHub Discussions
  • 📖 Documentation: This README and inline code documentation
  • 🧪 Examples: See tests/test.py for comprehensive usage examples

Made with ❤️ for the pandas community - Simplifying data type detection across cultures and locales

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_type_detector-1.0.1.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pandas_type_detector-1.0.1-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file pandas_type_detector-1.0.1.tar.gz.

File metadata

  • Download URL: pandas_type_detector-1.0.1.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e3ef84f90db385a251a3c35e35638e4846c094414eaf12d63d91789eb4b6201b
MD5 3698c978c9f83ba0207bdf2ac4bea6ba
BLAKE2b-256 7e42c1601ef7b4aac464fa915e1fb06572571ccc4e9e4efb0ee8be5196a2c808

See more details on using hashes here.

File details

Details for the file pandas_type_detector-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pandas_type_detector-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for pandas_type_detector-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8c5174f8d7f71f77c8456c368bb6db466b8479d088df7be093e5e64e4182131d
MD5 dc4dce5ede8dd04b29b93743b77134da
BLAKE2b-256 e3445314c14c2bc9c4fc7c50ebc0b248f34dcb6ffaaebcbd2073d874a195ba73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page