Skip to main content

Convert between Pandas and Polars schemas with ease

Project description

Cubchoo

PyPI version Python 3.8+ License: MIT Code style: black

Convert Universal Between Common Handling of Object Organization

๐Ÿป A lightweight Python library for seamless schema conversion between Pandas and Polars

Cubchoo provides simple, bidirectional conversion functions to transform schemas between Pandas and Polars, supporting all data types including nested structures, arrays, and categorical data. Perfect for projects that need to work with both libraries or migrate between them.

Table of Contents

Features

โœจ Key Features:

  • ๐Ÿ”„ Bidirectional Conversion: Convert schemas in both directions (Pandas โ†” Polars)
  • ๐Ÿ“ฆ Multiple Input Formats: Support for dictionaries, lists, DataFrames, Series, and Schema objects
  • ๐ŸŽฏ Comprehensive Type Support: All basic types (integers, floats, strings, booleans, datetimes, etc.)
  • โœ… Nullable Types: Full support for nullable integer and float types (Int64, Float64, etc.)
  • ๐Ÿท๏ธ Categorical Data: Support for categorical data types
  • ๐Ÿ›ก๏ธ Error Handling: Clear error messages with custom exceptions
  • โœ”๏ธ Input Validation: Validates schemas before conversion
  • ๐Ÿš€ Zero Dependencies: Only requires pandas and polars (no additional dependencies)
  • ๐Ÿชถ PyArrow Support: Optional PyArrow type support (install with pip install cubchoo[pyarrow])

Installation

From PyPI

pip install cubchoo

With PyArrow Support (Optional)

PyArrow support is available as an optional dependency:

pip install cubchoo[pyarrow]

From Source

git clone https://github.com/eddiethedean/cubchoo.git
cd cubchoo
pip install -e .

Development Installation

pip install -e ".[dev]"

All Dependencies

pip install cubchoo[all]

Requirements

  • Python: >= 3.8
  • pandas: >= 1.3.0
  • polars: >= 0.19.0
  • pyarrow: >= 10.0.0 (optional, for PyArrow type support)

Quick Start

Get started with Cubchoo in just a few lines of code!

Converting Pandas Schema to Polars

Cubchoo supports multiple Pandas schema formats - use whichever is most convenient:

import pandas as pd
from cubchoo import to_polars_schema

# Format 1: Dictionary with string dtype names
pandas_schema_dict = {
    "name": "string",
    "age": "Int64",
    "score": "Float64",
    "tags": "object",  # Lists/arrays in pandas are typically object type
}

# Format 2: Dictionary with pandas dtype objects
pandas_schema_dtypes = {
    "name": pd.StringDtype(),
    "age": pd.Int64Dtype(),
    "score": pd.Float64Dtype(),
}

# Format 3: pandas DataFrame (schema inferred from dtypes)
df = pd.DataFrame({
    "name": ["Alice", "Bob"],
    "age": [30, 25],
    "score": [95.5, 87.0],
})

# Format 4: List of tuples
pandas_schema_list = [
    ("name", "string"),
    ("age", "Int64"),
    ("score", "Float64"),
]

# All formats work identically!
polars_schema = to_polars_schema(pandas_schema_dict)
# or: to_polars_schema(pandas_schema_dtypes)
# or: to_polars_schema(df)
# or: to_polars_schema(pandas_schema_list)

print(polars_schema)
# Schema({'name': <class 'polars.datatypes.String'>, 
#         'age': <class 'polars.datatypes.Int64'>, 
#         'score': <class 'polars.datatypes.Float64'>, ...})

# Use directly with Polars DataFrame
import polars as pl
df_polars = pl.DataFrame({}, schema=polars_schema)

Converting Polars Schema to Pandas

import polars as pl
from cubchoo import to_pandas_schema

# Define a Polars schema
polars_schema = pl.Schema({
    "name": pl.String,
    "age": pl.Int32,
    "score": pl.Float64,
    "tags": pl.List(pl.String),
})

# Convert to Pandas schema (returns dict of dtype strings)
pandas_schema = to_pandas_schema(polars_schema)
print(pandas_schema)
# {'name': 'string', 'age': 'Int32', 'score': 'Float64', 'tags': 'object'}

# Use with pandas DataFrame
df = pd.DataFrame({}, dtype=pandas_schema)

Using PyArrow Types (Optional)

With PyArrow support installed, you can use PyArrow types (pa.DataType) directly in your schemas:

import pyarrow as pa
from cubchoo import to_polars_schema, to_pandas_schema

# Use PyArrow types in a schema dictionary
schema_with_pa = {
    "name": pa.string(),
    "age": pa.int32(),
    "score": pa.float64(),
}

# Convert PyArrow types to Polars
polars_schema = to_polars_schema(schema_with_pa)

# Convert PyArrow types to Pandas
pandas_schema = to_pandas_schema(schema_with_pa)

# You can also mix Pandas and PyArrow types
mixed_schema = {
    "name": "string",      # Pandas dtype string
    "age": pa.int32(),     # PyArrow type
    "score": "Float64",    # Pandas dtype string
}
polars_schema = to_polars_schema(mixed_schema)

Note: PyArrow support requires installing with pip install cubchoo[pyarrow].

Supported Type Mappings

Cubchoo supports comprehensive type mappings between Pandas and Polars. Below are the supported conversions:

Pandas โ†’ Polars

Pandas Type Polars Type
int8, Int8 pl.Int8
int16, Int16 pl.Int16
int32, Int32 pl.Int32
int64, Int64 pl.Int64
uint8, UInt8 pl.UInt8
uint16, UInt16 pl.UInt16
uint32, UInt32 pl.UInt32
uint64, UInt64 pl.UInt64
float32, Float32 pl.Float32
float64, Float64 pl.Float64
object, string, str pl.String
bool, boolean pl.Boolean
datetime64[ns] pl.Datetime
datetime64[us] pl.Datetime(time_unit="us")
datetime64[ms] pl.Datetime(time_unit="ms")
datetime64[s] pl.Datetime(time_unit="ms") *
date pl.Date
timedelta64[ns] pl.Duration
timedelta64[us] pl.Duration(time_unit="us")
timedelta64[ms] pl.Duration(time_unit="ms")
timedelta64[s] pl.Duration(time_unit="ms") *
category pl.Categorical

* Note: Polars doesn't support seconds (s) time unit, so datetime64[s] and timedelta64[s] are converted to milliseconds (ms) as the closest equivalent.

Polars โ†’ Pandas

Polars Type Pandas Type
pl.Int8 Int8
pl.Int16 Int16
pl.Int32 Int32
pl.Int64 Int64
pl.UInt8 UInt8
pl.UInt16 UInt16
pl.UInt32 UInt32
pl.UInt64 UInt64
pl.Float32 Float32
pl.Float64 Float64
pl.String, pl.Utf8 string
pl.Boolean boolean
pl.Datetime datetime64[ns]
pl.Date date
pl.Duration timedelta64[ns]
pl.Categorical category
pl.List(...) object
pl.Struct(...) object
pl.Map(...) object

Limitations and Notes

Complex Types

  • List/Array Types: Polars List types are converted to pandas object dtype, as pandas doesn't have native array/list dtypes. The semantic meaning is preserved but the structure information is lost.
  • Struct Types: Polars Struct types are converted to pandas object dtype, as pandas doesn't have native struct types.
  • Map Types: Polars Map types are converted to pandas object dtype, as pandas doesn't have native map types.

Datetime Handling

  • Pandas datetime64 types with different time units are converted to Polars Datetime with corresponding time units.
  • Timezone information in Polars Datetime types is not preserved when converting to pandas (pandas uses timezone-naive by default).
  • Polars Time types are converted to pandas object dtype, as pandas doesn't have a native time type.

Nullability

  • Pandas โ†’ Polars: Nullable types (Int64, Float64, etc.) are properly converted to their Polars equivalents.
  • Polars โ†’ Pandas: All Polars types can contain nulls, so nullable pandas dtypes (Int64, Float64, etc.) are used when appropriate.

Input Validation

Cubchoo validates schemas before conversion:

  • Duplicate field names: Raises SchemaError if duplicate field names are detected
  • Empty field names: Raises SchemaError if any field name is an empty string
  • Invalid field types: Raises SchemaError if field types are None
  • Invalid field name types: Raises SchemaError if field names are not strings

Advanced Examples

Working with DataFrames

Easily convert schemas between DataFrames:

import pandas as pd
import polars as pl
from cubchoo import to_polars_schema, to_pandas_schema

# Start with a pandas DataFrame
df_pandas = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 25, 35],
    "score": [95.5, 87.0, 92.3],
})

# Convert schema to Polars
polars_schema = to_polars_schema(df_pandas)

# Create empty Polars DataFrame with same schema
df_polars = pl.DataFrame({}, schema=polars_schema)

# Convert back to pandas schema
pandas_schema = to_pandas_schema(polars_schema)
print(pandas_schema)
# {'name': 'string', 'age': 'Int64', 'score': 'Float64'}

Categorical Data

Handle categorical data seamlessly:

import pandas as pd
import polars as pl
from cubchoo import to_polars_schema, to_pandas_schema

# Pandas categorical
pandas_schema = {"category": "category"}
polars_schema = to_polars_schema(pandas_schema)
assert polars_schema["category"] == pl.Categorical()

# Polars categorical
polars_schema = pl.Schema({"category": pl.Categorical()})
pandas_schema = to_pandas_schema(polars_schema)
assert pandas_schema["category"] == "category"

Series Conversion

Convert single-column schemas from pandas Series:

import pandas as pd
from cubchoo import to_polars_schema

# Named Series
series = pd.Series([1, 2, 3], name="numbers", dtype="Int64")
polars_schema = to_polars_schema(series)
# Returns: Schema({'numbers': <class 'polars.datatypes.Int64'>})

Round-Trip Conversion

Verify schema integrity with round-trip conversions:

import pandas as pd
import polars as pl
from cubchoo import to_polars_schema, to_pandas_schema

# Start with Pandas schema
original = {
    "name": "string",
    "age": "Int64",
    "score": "Float64",
}

# Convert to Polars and back
polars_schema = to_polars_schema(original)
converted_back = to_pandas_schema(polars_schema)

# Verify types match
assert converted_back["name"] == original["name"]
assert converted_back["age"] == original["age"]
assert converted_back["score"] == original["score"]

Datetime with Time Units

Handle different datetime precision:

import pandas as pd
import polars as pl
from cubchoo import to_polars_schema, to_pandas_schema

# Pandas with different time units
pandas_schema = {
    "ns_timestamp": "datetime64[ns]",
    "us_timestamp": "datetime64[us]",
    "ms_timestamp": "datetime64[ms]",
}
polars_schema = to_polars_schema(pandas_schema)

# Polars with time units
polars_schema = {
    "ns": pl.Datetime(time_unit="ns"),
    "us": pl.Datetime(time_unit="us"),
    "ms": pl.Datetime(time_unit="ms"),
}
pandas_schema = to_pandas_schema(polars_schema)

Error Handling

Cubchoo provides clear error messages through custom exceptions. All exceptions inherit from ConversionError, so you can catch all conversion errors at once or handle them individually:

from cubchoo import ConversionError, UnsupportedTypeError, SchemaError

# Example 1: Handle specific error types
try:
    schema = to_polars_schema(invalid_schema)
except SchemaError as e:
    print(f"Invalid schema structure: {e}")
    # Handles: duplicate field names, empty field names, invalid field types, etc.
except UnsupportedTypeError as e:
    print(f"Unsupported type: {e}")
    # Handles: types that cannot be converted between Pandas and Polars
except ConversionError as e:
    print(f"General conversion error: {e}")
    # Catches all conversion-related errors (base class)

# Example 2: Catch all conversion errors
try:
    schema = to_polars_schema(invalid_schema)
except ConversionError as e:
    print(f"Conversion failed: {e}")
    # This will catch SchemaError, UnsupportedTypeError, and any future error types

API Reference

to_polars_schema(pandas_schema)

Convert a Pandas schema to a Polars Schema.

Parameters:

  • pandas_schema: Pandas schema in any supported format:
    • dict[str, dtype]: Dictionary mapping field names to dtypes
    • pandas.DataFrame: DataFrame (schema inferred from dtypes)
    • pandas.Series: Series (for single column)
    • list[tuple[str, dtype]]: List of (field_name, dtype) tuples

Returns:

  • polars.Schema: Polars Schema object mapping field names to Polars types

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

Example:

import pandas as pd
from cubchoo import to_polars_schema

# All formats work:
schema1 = {"name": "string", "age": "Int64"}
schema2 = pd.DataFrame({"name": ["Alice"], "age": [30]})
schema3 = [("name", "string"), ("age", "Int64")]

polars_schema = to_polars_schema(schema1)  # or schema2, or schema3

to_pandas_schema(polars_schema)

Convert a Polars Schema to a Pandas schema.

Parameters:

  • polars_schema: Polars schema in any supported format:
    • pl.Schema object
    • dict[str, pl.DataType]: Dictionary mapping field names to types
    • list[tuple[str, pl.DataType]]: List of (field_name, type) tuples

Returns:

  • dict[str, str]: Dictionary mapping column names to pandas dtype strings

Raises:

  • SchemaError: If the schema structure is invalid
  • UnsupportedTypeError: If a type cannot be converted

Example:

import polars as pl
from cubchoo import to_pandas_schema

polars_schema = pl.Schema({
    "name": pl.String,
    "age": pl.Int32
})

pandas_schema = to_pandas_schema(polars_schema)
# Returns: {'name': 'string', 'age': 'Int32'}

PyArrow Type Support

Both to_polars_schema() and to_pandas_schema() support PyArrow types (pa.DataType) as input. When PyArrow types are detected in a schema, they are automatically converted to the target format.

Example:

import pyarrow as pa
from cubchoo import to_polars_schema, to_pandas_schema

# Schema with PyArrow types
schema = {
    "name": pa.string(),
    "age": pa.int32(),
    "score": pa.float64(),
}

# Convert to Polars
polars_schema = to_polars_schema(schema)

# Convert to Pandas
pandas_schema = to_pandas_schema(schema)

Note: PyArrow support requires installing with pip install cubchoo[pyarrow].

Development

Setup

  1. Clone the repository:
git clone https://github.com/eddiethedean/cubchoo.git
cd cubchoo
  1. Install in development mode:
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=cubchoo --cov-report=html

# Run specific test file
pytest tests/test_converters.py

Code Quality

# Format code
ruff format .

# Lint code
ruff check .

# Type checking
mypy cubchoo

Project Structure

cubchoo/
โ”œโ”€โ”€ cubchoo/
โ”‚   โ”œโ”€โ”€ __init__.py          # Public API
โ”‚   โ”œโ”€โ”€ converters.py         # Core conversion functions
โ”‚   โ”œโ”€โ”€ type_mappings.py      # Type mapping dictionaries
โ”‚   โ””โ”€โ”€ errors.py             # Custom exceptions
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_converters.py    # Conversion tests
โ”‚   โ””โ”€โ”€ test_type_mappings.py # Type mapping tests
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ CHANGELOG.md
โ””โ”€โ”€ pyproject.toml            # Package configuration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Odos Matthews


Made with โค๏ธ for the Python data science community

Contributing

Contributions are welcome and greatly appreciated! ๐ŸŽ‰

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass (pytest)
  6. Ensure code quality (ruff format . && ruff check . && mypy cubchoo)
  7. Commit your changes (git commit -m 'Add some amazing feature')
  8. Push to the branch (git push origin feature/amazing-feature)
  9. Open a Pull Request

Reporting Issues

If you find a bug or have a feature request, please open an issue on GitHub.

Related Projects

  • charmander - Convert between Polars and PySpark schemas
  • poldantic - Convert between Pydantic models and Polars schemas

Inspiration

This project is inspired by charmander, which provides similar functionality for converting between Polars and PySpark schemas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cubchoo-0.1.0.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cubchoo-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file cubchoo-0.1.0.tar.gz.

File metadata

  • Download URL: cubchoo-0.1.0.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cubchoo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8abe447d9c9025921851f414c6b641673e17c4470cba6336502738aaad287c2d
MD5 b6a379548cc7c4d28981273a0c09e7a2
BLAKE2b-256 56fffb73373c9b7d05607823e4b6bb0c8723e968c52b12cddb75745a63d1c804

See more details on using hashes here.

File details

Details for the file cubchoo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cubchoo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for cubchoo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0fa05f93e54ab216991e5473d508650449f3c00f50cb61f8aaa9d25bce513330
MD5 f5d67c1e214393b60e8a3e5340c4825b
BLAKE2b-256 c5915ea6b84f5a7c4c0042f9ba766dbb6ea271e07f6e7ee8faa147da0b3d970c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page