Intelligent DataFrame type detection with sophisticated locale-aware parsing, confidence scoring, and smart text filtering
Project description
pandas-type-detector
🔍 Intelligent DataFrame Type Detection with Locale Awareness
A robust, production-ready library for automatically detecting and converting pandas DataFrame column types with sophisticated locale-aware parsing, confidence scoring, and enhanced text filtering capabilities.
🚀 Key Features
- 🌍 Locale-Aware Parsing: Native support for PT-BR and EN-US number formats, dates, and boolean values
- 🎯 Smart Text Filtering: Advanced algorithms prevent text containing numbers from being misclassified as numeric
- 📊 Confidence Scoring: Get reliability scores for each type detection decision
- 🛡️ Robust Error Handling: Configurable strategies for handling conversion errors
- ⚡ Performance Optimized: Intelligent sampling and early-exit strategies for large datasets
- 🧩 Modular Architecture: Extensible design for adding new data types and locales
- ✅ Production Tested: Successfully handles complex real-world data scenarios
📦 Installation
pip install pandas-type-detector
🎯 Quick Start
import pandas as pd
from pandas_type_detector import TypeDetectionPipeline
# Sample data with mixed formats
data = {
'revenue': ['1.234,56', '2.890,00', '543,21'], # PT-BR currency format
'quantity': ['10', '25', '8'], # Integers
'active': ['Sim', 'Não', 'Sim'], # PT-BR booleans
'date': ['2025-01-15', '2025-02-20', '2025-03-10'], # ISO dates
'description': ['(31) Product A', '(45) Service B', '(12) Item C'] # Text with numbers
}
df = pd.DataFrame(data)
print("Original dtypes:")
print(df.dtypes)
# All columns are 'object' initially
# Initialize pipeline with Portuguese (Brazil) locale
pipeline = TypeDetectionPipeline(locale="pt-br", on_error="coerce")
# Automatically detect and convert types
df_converted = pipeline.fix_dataframe_dtypes(df)
print("\\nConverted dtypes:")
print(df_converted.dtypes)
# Output:
# revenue float64 ← Correctly parsed PT-BR format
# quantity Int64 ← Detected as integer
# active boolean ← Portuguese booleans converted
# date datetime64[ns] ← ISO dates parsed
# description object ← Text with numbers kept as text
🌐 Locale Support
🇧🇷 PT-BR (Portuguese Brazil)
- Decimal separator:
,(comma) →1.234,56becomes1234.56 - Thousands separator:
.(dot) →1.000.000,00 - Currency symbols:
R$,BRL - Boolean values:
Sim/Não,Verdadeiro/Falso,S/N - Date formats:
DD/MM/YYYY,YYYY-MM-DD
🇺🇸 EN-US (English United States)
- Decimal separator:
.(dot) →1,234.56 - Thousands separator:
,(comma) →1,000,000.00 - Currency symbols:
$,USD - Boolean values:
True/False,Yes/No,Y/N - Date formats:
MM/DD/YYYY,YYYY-MM-DD
📚 Advanced Usage
🔧 Error Handling Strategies
# Strategy 1: Coerce errors to NaN (default - recommended)
pipeline = TypeDetectionPipeline(locale="en-us", on_error="coerce")
df_safe = pipeline.fix_dataframe_dtypes(df)
# Strategy 2: Raise exceptions on conversion errors
pipeline = TypeDetectionPipeline(locale="en-us", on_error="raise")
try:
df_strict = pipeline.fix_dataframe_dtypes(df)
except ValueError as e:
print(f"Conversion error: {e}")
# Strategy 3: Ignore problematic columns
pipeline = TypeDetectionPipeline(locale="en-us", on_error="ignore")
df_conservative = pipeline.fix_dataframe_dtypes(df)
🔍 Individual Column Analysis
# Get detailed detection information
result = pipeline.detect_column_type(df['revenue'])
print(f"Detected type: {result.data_type.value}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Locale: {result.metadata['locale']}")
print(f"Parsing details: {result.metadata}")
# Example output:
# Detected type: float
# Confidence: 95.00%
# Locale: pt-br
# Parsing details: {'locale': 'pt-br', 'is_integer': False, 'numeric_count': 3, ...}
🎛️ Column Selection and Skipping
# Skip specific columns during conversion
df_converted = pipeline.fix_dataframe_dtypes(
df,
skip_columns=['id', 'raw_text', 'keep_as_string']
)
# Skip columns remain as original 'object' type
# Other columns are automatically converted
⚙️ Performance Tuning
# Optimize for large datasets
pipeline = TypeDetectionPipeline(
locale="pt-br",
sample_size=5000, # Analyze up to 5000 rows per column (default: 1000)
on_error="coerce"
)
# For smaller datasets, use full analysis
pipeline = TypeDetectionPipeline(
locale="en-us",
sample_size=10000 # Effectively analyze all rows for small datasets
)
🛡️ Smart Text Filtering
One of the key improvements in this library is sophisticated text filtering that prevents common misclassification issues:
# These text values are correctly identified as text, not numeric
problematic_data = pd.Series([
"(31) Week from 28/jul to 3/aug", # Text with numbers
"(45) Product description", # Text with parenthetical numbers
"Order #12345 - Item A", # Mixed text and numbers
"Section 3.1.4 Overview" # Version numbers in text
])
result = pipeline.detect_column_type(problematic_data)
print(result.data_type) # DataType.TEXT (correctly identified as text)
🧪 Testing
The library includes a comprehensive test suite with 17 test cases covering all functionality:
cd pandas-type-detector
poetry run pytest tests/test.py -v
Test Coverage
- ✅ Numeric detection (integers, floats) for both locales
- ✅ Boolean detection in multiple languages
- ✅ DateTime parsing and conversion
- ✅ Text-with-numbers rejection algorithms
- ✅ Skip columns functionality
- ✅ Error handling strategies
- ✅ Real-world data scenarios
- ✅ Edge cases and boundary conditions
📊 Supported Data Types
| Data Type | Description | Example Values |
|---|---|---|
| Integer | Whole numbers | 123, 1.000 (PT-BR), 1,000 (EN-US) |
| Float | Decimal numbers | 123,45 (PT-BR), 123.45 (EN-US) |
| Boolean | True/False values | Sim/Não (PT-BR), Yes/No (EN-US) |
| DateTime | Date and time | 2025-01-15, 15/01/2025 |
| Text | String data | Any text, including mixed alphanumeric |
🔧 Extensibility
Adding a New Locale
from pandas_type_detector import LOCALES, LocaleConfig
# Add German locale
LOCALES['de-de'] = LocaleConfig(
name='de-de',
decimal_separator=',',
thousands_separator='.',
currency_symbols=['€', 'EUR'],
date_formats=[r'^\\d{1,2}\\.\\d{1,2}\\.\\d{4}$'] # DD.MM.YYYY
)
# Use the new locale
pipeline = TypeDetectionPipeline(locale="de-de")
Creating Custom Detectors
from pandas_type_detector import TypeDetector, DataType, DetectionResult
class EmailDetector(TypeDetector):
def detect(self, series):
# Custom email detection logic
email_pattern = r'^[\\w\\.-]+@[\\w\\.-]+\\.[\\w]+$'
matches = series.str.match(email_pattern).sum()
confidence = matches / len(series)
if confidence >= 0.8:
return DetectionResult(DataType.TEXT, confidence, {"format": "email"})
return DetectionResult(DataType.UNKNOWN, confidence, {})
def convert(self, series):
# Email-specific processing if needed
return series.astype(str)
🚀 Performance Characteristics
- Memory Efficient: Processes columns independently without loading entire dataset
- Sampling Strategy: Configurable sampling reduces processing time for large datasets
- Early Exit: Stops analysis when high confidence is reached (≥90%)
- Production Ready: Optimized for ETL pipelines and data processing workflows
Benchmarks
- ✅ Tested with datasets up to 14,607 rows in production
- ✅ Handles complex mixed-format data reliably
- ✅ Minimal performance overhead on modern hardware
🤝 Contributing
We welcome contributions! The modular architecture makes it easy to:
- Add new locales - Extend
LOCALESconfiguration - Create new detectors - Inherit from
TypeDetectorbase class - Improve algorithms - Enhance existing detection logic
- Add test cases - Expand the test suite for new scenarios
Development Setup
git clone https://github.com/yourusername/pandas-type-detector
cd pandas-type-detector
poetry install
poetry run pytest
📋 Requirements
- Python: 3.7+ (tested on 3.7, 3.8, 3.9, 3.10, 3.11, 3.12)
- pandas: ≥1.0.0
- numpy: ≥1.19.0
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
This library was developed to solve real-world data quality challenges in Brazilian financial and business data processing. It has been successfully deployed in production environments handling complex PT-BR formatted datasets.
Special thanks to the pandas and NumPy communities for providing the foundation that makes this work possible.
📞 Support
- 🐛 Bug Reports: GitHub Issues
- 💡 Feature Requests: GitHub Discussions
- 📖 Documentation: This README and inline code documentation
- 🧪 Examples: See
tests/test.pyfor comprehensive usage examples
Made with ❤️ for the pandas community - Simplifying data type detection across cultures and locales
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandas_type_detector-1.0.1.tar.gz.
File metadata
- Download URL: pandas_type_detector-1.0.1.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3ef84f90db385a251a3c35e35638e4846c094414eaf12d63d91789eb4b6201b
|
|
| MD5 |
3698c978c9f83ba0207bdf2ac4bea6ba
|
|
| BLAKE2b-256 |
7e42c1601ef7b4aac464fa915e1fb06572571ccc4e9e4efb0ee8be5196a2c808
|
File details
Details for the file pandas_type_detector-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pandas_type_detector-1.0.1-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.3 Linux/5.15.167.4-microsoft-standard-WSL2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c5174f8d7f71f77c8456c368bb6db466b8479d088df7be093e5e64e4182131d
|
|
| MD5 |
dc4dce5ede8dd04b29b93743b77134da
|
|
| BLAKE2b-256 |
e3445314c14c2bc9c4fc7c50ebc0b248f34dcb6ffaaebcbd2073d874a195ba73
|