Skip to main content

A data cleaning library for Pandas and Polars DataFrames with a simple, chainable API.

Project description

๐Ÿงน Nullaxe

PyPI version Python 3.8+ License: MIT Code style: black

Nullaxe is a comprehensive, high-performance data cleaning and preprocessing library for Python, designed to work seamlessly with both pandas and polars DataFrames. With its intuitive, chainable API, Nullaxe transforms the traditionally tedious process of data cleaning into an elegant, readable workflow.


๐Ÿš€ Key Features

  • ๐Ÿ”— Fluent, Chainable API: Clean your data in a single, readable chain of commands
  • โšก Dual Backend Support: Works effortlessly with both pandas and polars DataFrames
  • ๐Ÿงน Comprehensive Cleaning: From basic cleaning to advanced data extraction and transformation
  • ๐Ÿช„ Display Formatting Pipeline: Format columns for presentation (currency, percentages, thousands separators, date formatting, truncation, title-cased headers)
  • ๐Ÿ“Š Intelligent Outlier Detection: Multiple methods including IQR and Z-score analysis
  • ๐Ÿ” Advanced Data Extraction: Extract emails, phone numbers, and custom patterns with regex
  • ๐ŸŽฏ Smart Type Handling: Automatic type inference and standardization
  • ๐Ÿ“ˆ Performance Optimized: Designed for speed and memory efficiency
  • ๐Ÿ”ง Extensible: Easily add custom cleaning functions to your pipeline

๐Ÿ“ฆ Installation

Install Nullaxe easily with pip:

pip install nullaxe

Requirements:

  • Python 3.8+
  • pandas >= 1.0
  • polars >= 0.19

โšก Quick Start

Here's how to transform messy data into clean, analysis-ready datasets:

import pandas as pd
import nullaxe as nlx

# Create a messy sample dataset
data = {
    'First Name': ['  John  ', 'Jane', '  Peter', 'JOHN', None],
    'Last Name': ['Smith', 'Doe', 'Jones', 'Smith', 'Brown'],
    'Age': [28, 34, None, 28, 45],
    'Email': ['john@email.com', 'invalid-email', 'peter@test.org', 'john@email.com', None],
    'Phone': ['123-456-7890', '(555) 123-4567', 'not-a-phone', '123.456.7890', '+1-800-555-0199'],
    'Salary': ['$70,000', '80000', '$65,000.50', '$70,000', 'โ‚ฌ75,000'],
    'Active': ['True', 'False', 'yes', 'TRUE', 'N'],
    'Notes': ['  Important client  ', '', '   Follow up   ', None, 'VIP']
}
df = pd.DataFrame(data)

# Clean the entire dataset with a single chain
clean_df = (
    nlx(df)
    .clean_column_names()                    # Standardize column names
    .fill_missing(value='Unknown')           # Fill missing values
    .remove_whitespace()                     # Clean whitespace
    .remove_duplicates()                     # Remove duplicate rows
    .standardize_booleans()                  # Convert boolean-like values
    .extract_email()                         # Extract email addresses
    .extract_phone_numbers()                 # Extract phone numbers
    .extract_and_clean_numeric()             # Extract numeric values from strings
    .drop_single_value_columns()             # Remove columns with only one value
    .remove_outliers(method='iqr')           # Handle outliers
    .format_for_display(                     # NEW: Format for presentation
        rules={
            'salary': {'type': 'currency', 'symbol': '$', 'decimals': 2},
            'age': {'type': 'thousands'},
        },
        column_case='title'
    )
    .to_df()                                 # Return the cleaned, formatted DataFrame
)

print(clean_df.head())

๐Ÿ“– Complete API Reference

๐Ÿ—๏ธ Initialization

import nullaxe as nlx

# Initialize with any DataFrame
cleaner = nlx(df)  # Works with pandas or polars DataFrames

๐Ÿ“ Column Name Standardization

Transform column names to consistent formats:

# General column cleaning with case conversion
.clean_column_names(case='snake')  # Options: 'snake', 'camel', 'pascal', 'kebab', 'title', 'lower', 'screaming_snake'

# Specific case conversions
.snakecase()                       # column_name
.camelcase()                       # columnName
.pascalcase()                      # ColumnName
.kebabcase()                       # column-name
.titlecase()                       # Column Name
.lowercase()                       # column name
.screaming_snakecase()             # COLUMN_NAME

๐Ÿ”„ Data Deduplication

Remove duplicate data efficiently:

.remove_duplicates()               # Remove duplicate rows across all columns

โŒ Missing Data Management

Handle missing values with precision:

# Fill missing values
.fill_missing(value=0)                           # Fill all columns with 0
.fill_missing(value='Unknown', subset=['name'])  # Fill specific columns

# Drop missing values
.drop_missing()                                  # Drop rows with any missing values
.drop_missing(how='all')                         # Drop rows where all values are missing
.drop_missing(thresh=3)                          # Keep rows with at least 3 non-null values
.drop_missing(axis='columns')                    # Drop columns with missing values
.drop_missing(subset=['name', 'email'])          # Consider only specific columns

๐Ÿงฝ Text and Whitespace Cleaning

Clean and standardize text data:

.remove_whitespace()                             # Remove leading/trailing whitespace
.replace_text('old', 'new')                      # Replace text in all columns
.replace_text('old', 'new', subset=['name'])     # Replace in specific columns
.remove_punctuation()                            # Remove punctuation marks
.remove_punctuation(subset=['description'])      # Remove from specific columns

๐Ÿ—‚๏ธ Column Management

Manage DataFrame structure:

.drop_single_value_columns()                     # Remove columns with only one unique value
.remove_unwanted_rows_and_cols()                 # Remove rows/cols with unwanted values
.remove_unwanted_rows_and_cols(                  # Custom unwanted values
    unwanted_values=['', 'N/A', 'NULL']
)

๐Ÿ“Š Outlier Detection and Handling

Sophisticated outlier management:

# General outlier handling
.handle_outliers()                               # Default: IQR method, factor=1.5
.handle_outliers(method='zscore', factor=2.0)    # Z-score method
.handle_outliers(subset=['salary', 'age'])       # Specific columns only

# Cap outliers (replace with threshold values)
.cap_outliers()                                  # Cap using IQR method
.cap_outliers(method='zscore', factor=2.5)       # Cap using Z-score

# Remove outlier rows entirely
.remove_outliers()                               # Remove rows with outliers
.remove_outliers(method='iqr', factor=1.5)       # Custom parameters

Outlier Detection Methods:

  • IQR (Interquartile Range): Q1 - factor*IQR to Q3 + factor*IQR
  • Z-Score: Values beyond factor standard deviations from the mean

๐Ÿ”ง Data Type Standardization

Convert and standardize data types:

# Boolean standardization
.standardize_booleans()                          # Convert 'yes/no', 'true/false', etc.
.standardize_booleans(
    true_values=['yes', 'y', '1', 'true'],       # Custom true values
    false_values=['no', 'n', '0', 'false'],     # Custom false values
    columns=['active', 'verified']              # Specific columns
)

Default Boolean Mappings:

  • True: 'true', '1', 't', 'yes', 'y', 'on'
  • False: 'false', '0', 'f', 'no', 'n', 'off'

๐Ÿ” Advanced Data Extraction

Extract structured data from unstructured text:

# Email extraction
.extract_email()                                 # Extract emails from all columns
.extract_email(subset=['contact_info'])          # From specific columns

# Phone number extraction
.extract_phone_numbers()                         # Extract phone numbers
.extract_phone_numbers(subset=['contact'])       # From specific columns

# Numeric data extraction and cleaning
.extract_and_clean_numeric()                     # Extract numbers from text
.extract_and_clean_numeric(subset=['prices'])    # From specific columns

# Custom regex extraction (interactive)
.extract_with_regex()                            # Prompts for regex pattern
.extract_with_regex(subset=['text_column'])      # From specific columns

# Combined numeric cleaning
.clean_numeric()                                 # Extract + outlier handling
.clean_numeric(method='zscore', factor=2.0)      # Custom outlier parameters

๐Ÿช„ Display / Presentation Formatting (NEW in 0.3.0)

Format cleaned data for reports, dashboards, exports:

.format_for_display(
    rules={
        'price': {'type': 'currency', 'symbol': '$', 'decimals': 2},
        'growth': {'type': 'percentage', 'decimals': 1},
        'volume': {'type': 'thousands'},
        'description': {'type': 'truncate', 'length': 30},
        'event_date': {'type': 'datetime', 'format': '%B %d, %Y'}
    },
    column_case='title'  # or None to preserve original column names
)

Supported rule types:

  • currency: symbol + thousands + decimal precision
  • percentage: multiplies by 100 + suffix %
  • thousands: adds thousands separators, removes trailing .0 for whole floats
  • truncate: shortens long text and appends ...
  • datetime: parses and formats date/time strings

You can also call the function directly:

from nullaxe.functions import format_for_display
formatted = format_for_display(df, rules=..., column_case='title')

๐Ÿ“ค Output

.to_df()                                         # Return the cleaned DataFrame

๐ŸŽฏ Advanced Usage Examples

Real-World Data Cleaning Pipeline

import pandas as pd
import nullaxe as nlx

# Load messy customer data
df = pd.read_csv('messy_customer_data.csv')

# Comprehensive cleaning + formatting pipeline
clean_customers = (
    nlx(df)
    .clean_column_names(case='snake')
    .fill_missing(value='Not Provided')
    .remove_whitespace()
    .standardize_booleans(columns=['is_active', 'newsletter_opt_in'])
    .extract_email(subset=['contact_info'])
    .extract_phone_numbers(subset=['contact_info'])
    .extract_and_clean_numeric(subset=['revenue', 'age'])
    .remove_outliers(method='iqr', factor=2.0, subset=['revenue'])
    .drop_single_value_columns()
    .remove_duplicates()
    .format_for_display(
        rules={
            'revenue': {'type': 'currency', 'symbol': '$', 'decimals': 2},
            'age': {'type': 'thousands'},
            'signup_date': {'type': 'datetime', 'format': '%Y-%m-%d'}
        },
        column_case='title'
    )
    .to_df()
)

Financial Data Processing

financial_clean = (
    nlx(transactions_df)
    .clean_column_names(case='snake')
    .fill_missing(value=0, subset=['amount'])
    .extract_and_clean_numeric(subset=['amount', 'fee'])
    .standardize_booleans(subset=['is_recurring'])
    .cap_outliers(method='zscore', factor=3.0, subset=['amount'])
    .remove_whitespace()
    .format_for_display(
        rules={'amount': {'type': 'currency', 'symbol': '$', 'decimals': 2}},
        column_case='title'
    )
    .to_df()
)

Survey Data Standardization

survey_clean = (
    nlx(survey_df)
    .clean_column_names(case='snake')
    .standardize_booleans(
        true_values=['Yes', 'Y', 'Agree', 'True', '1'],
        false_values=['No', 'N', 'Disagree', 'False', '0']
    )
    .fill_missing(value='No Response')
    .remove_whitespace()
    .drop_single_value_columns()
    .format_for_display(
        rules={'age': {'type': 'thousands'}},
        column_case='title'
    )
    .to_df()
)

๐Ÿ”„ Method Chaining Benefits

Nullaxe's chainable API provides several advantages:

  1. Readability: Each step is clear and self-documenting
  2. Maintainability: Easy to add, remove, or reorder operations
  3. Performance: Optimized internal operations reduce memory overhead
  4. Flexibility: Mix and match operations based on your data's needs
# Traditional approach (verbose and hard to follow)
df = remove_duplicates(df)
df = fill_missing(df, value='Unknown')
df = standardize_booleans(df)
df = remove_outliers(df, method='iqr')

# Nullaxe approach (clean and readable)
df = (nlx(df)
      .remove_duplicates()
      .fill_missing(value='Unknown')
      .standardize_booleans()
      .remove_outliers(method='iqr')
      .format_for_display(rules={'value': {'type': 'currency'}}, column_case='title')
      .to_df())

๐Ÿš€ Performance Tips

  1. Use polars for large datasets - Nullaxe automatically optimizes for polars' performance
  2. Chain operations efficiently - Nullaxe minimizes intermediate copies
  3. Specify subsets - Process only the columns you need
  4. Choose appropriate outlier methods - IQR is faster, Z-score is more sensitive
# Performance-optimized pipeline
result = (
    nlx(large_df)
    .remove_duplicates()
    .drop_single_value_columns()
    .fill_missing(value=0, subset=['numeric_cols'])
    .remove_outliers(method='iqr', subset=['revenue'])
    .format_for_display(rules={'revenue': {'type': 'currency'}}, column_case=None)
    .to_df()
)

๐Ÿงช Testing and Quality Assurance

Nullaxe includes comprehensive test coverage with 118+ test cases covering:

  • โœ… pandas and polars compatibility
  • โœ… Edge cases and error handling
  • โœ… Performance optimization
  • โœ… Data integrity preservation
  • โœ… Type safety and validation
  • โœ… Presentation formatting (currency, percentage, thousands, truncation, datetime, column casing)

Run tests locally:

git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]
pytest tests/

๐Ÿค Contributing

We welcome contributions! Nullaxe is designed to be extensible and community-driven.

How to Contribute

  1. Fork the repository on GitHub
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Add your changes with comprehensive tests
  4. Follow the coding standards (black formatting, type hints)
  5. Run the test suite: pytest tests/
  6. Submit a pull request with a clear description

Development Setup

# Clone and setup development environment
git clone https://github.com/johntocci/nullaxe
cd nullaxe
pip install -e .[dev]

# Run tests
pytest tests/

# Format code
black src/ tests/

Adding New Functions

Nullaxe's modular architecture makes it easy to add new cleaning functions:

  1. Create your function in src/nullaxe/functions/
  2. Add it to the imports in src/nullaxe/functions/__init__.py
  3. Add a corresponding method to the Nullaxe class
  4. Write comprehensive tests in tests/

๐Ÿ“‹ Changelog

  • Migration: replace import sanex as nlx with import nullaxe as nlx and sx( with nlx(

Version 0.3.0

  • โœจ Added format_for_display function + chain method for presentation formatting
  • โœจ Added support for currency, percentage, thousands, truncate, datetime formatting
  • โœจ Title-case header option integrated into formatting step
  • ๐Ÿ›  Refactored internal formatting for pandas + polars parity
  • โœ… Expanded test suite (now 118+ tests) including display formatting
  • โšก Improved thousands formatting (no trailing .0 on whole floats)

Version 0.2.0

  • โœจ Added comprehensive data extraction capabilities
  • โœจ Enhanced outlier detection with multiple methods
  • โœจ Improved text processing and punctuation removal
  • ๐Ÿ› Fixed boolean standardization edge cases
  • ๐Ÿ› Resolved missing data handling in complex workflows
  • โšก Performance optimizations for large datasets
  • ๐Ÿ“š Comprehensive documentation updates

Version 0.1.0

  • ๐ŸŽ‰ Initial release with core cleaning functionality
  • ๐Ÿ”— Chainable API implementation
  • ๐Ÿ”„ pandas and polars support

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Built with โค๏ธ for the data science community
  • Inspired by the need for simple, powerful data cleaning tools
  • Thanks to all contributors and users who help improve Nullaxe

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nullaxe-0.4.1.tar.gz (47.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nullaxe-0.4.1-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file nullaxe-0.4.1.tar.gz.

File metadata

  • Download URL: nullaxe-0.4.1.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for nullaxe-0.4.1.tar.gz
Algorithm Hash digest
SHA256 8d16782393a0df82a45e4bc2543c619717c70d047a4e3a5a12e87b14e0f9de6d
MD5 672a53c2e8197ec4019fd0fd73e445df
BLAKE2b-256 0c4474438a03d3421a70c0cddd8eed2decef722aa5e7da919065217cd16ffa61

See more details on using hashes here.

File details

Details for the file nullaxe-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: nullaxe-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for nullaxe-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 13840c1ec62a0aa78bdc48ca6ccba4204e83989d7d03c0802078b3c8e2f55a82
MD5 0b962ce88d785ea0ef42c3cca2dde9d6
BLAKE2b-256 bba7a4e95fef473776c35badfd3dac654c3b9325da1853a7ea09f470a0219194

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page