Skip to main content

A library for switching pandas backend to pyarrow

Project description

pandas-pyarrow

PyPI - Python Version PyPI Version License Ubuntu Windows macOS Continuous Integration Code Quality Coverage Status Ruff Last Commit

pandas-pyarrow simplifies the conversion of pandas backends to pyarrow, allowing a seamless switch to pyarrow pandas backend.

Get started:

Installation

Install the package using pip:

pip install pandas-pyarrow

Usage

import pandas as pd
from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = convert_to_pyarrow(df)

print(adf.dtypes)

Outputs:

A     int64[pyarrow]
B    string[pyarrow]
C    double[pyarrow]
D      bool[pyarrow]
dtype: object

Furthermore, it's possible to add mappings or override existing ones:

import pandas as pd

from pandas_pyarrow import PandasArrowConverter

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Instantiate a PandasArrowConverter object
pandas_pyarrow_converter = PandasArrowConverter(
    custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = pandas_pyarrow_converter(df)

print(adf.dtypes)

outputs:

A     int32[pyarrow]
B    string[pyarrow]
C     float[pyarrow]
D      bool[pyarrow]
dtype: object

pandas-pyarrow also support db-dtypes used by bigquery python sdk:

pip install pandas-gbq

or

pip install pandas-pyarrow[bigquery]
import pandas_gbq as gbq

from pandas_pyarrow import PandasArrowConverter

# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"

# Construct the query string
query = """
    SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""

# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
pandas_pyarrow_converter = PandasArrowConverter()
adf = pandas_pyarrow_converter(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)

outputs:

unique_key                               object
complaint_description                    object
source                                   object
status                                   object
status_change_date          datetime64[us, UTC]
created_date                datetime64[us, UTC]
last_update_date            datetime64[us, UTC]
close_date                  datetime64[us, UTC]
incident_address                         object
street_number                            object
street_name                              object
city                                     object
incident_zip                              Int64
county                                   object
state_plane_x_coordinate                 object
state_plane_y_coordinate                float64
latitude                                float64
longitude                               float64
location                                 object
council_district_code                     Int64
map_page                                 object
map_tile                                 object
dtype: object
unique_key                         string[pyarrow]
complaint_description              string[pyarrow]
source                             string[pyarrow]
status                             string[pyarrow]
status_change_date          timestamp[us][pyarrow]
created_date                timestamp[us][pyarrow]
last_update_date            timestamp[us][pyarrow]
close_date                  timestamp[us][pyarrow]
incident_address                   string[pyarrow]
street_number                      string[pyarrow]
street_name                        string[pyarrow]
city                               string[pyarrow]
incident_zip                        int64[pyarrow]
county                             string[pyarrow]
state_plane_x_coordinate           string[pyarrow]
state_plane_y_coordinate           double[pyarrow]
latitude                           double[pyarrow]
longitude                          double[pyarrow]
location                           string[pyarrow]
council_district_code               int64[pyarrow]
map_page                           string[pyarrow]
map_tile                           string[pyarrow]
dtype: object

Documentation

Documentation is available online.

Purposes

  • Simplify the conversion process between pandas' pyarrow and numpy backends.
  • Provide seamless integration with the pyarrow pandas backend, even for challenging dtypes such as float16 or db-dtypes.
  • Standardize dtypes for db-dtypes used by the BigQuery Python SDK.

Example:

import pandas as pd

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')

df.convert_dtypes(dtype_backend='pyarrow')

will raise an error:

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double

but with pandas-pyarrow:

import pandas as pd

from pandas_pyarrow import convert_to_pyarrow

# Create a pandas DataFrame
df = pd.DataFrame({

    'C': [1.1, 2.2, 3.3],

}, dtype='float16')
adf = convert_to_pyarrow(df)
print(adf.dtypes)

outputs:

C    halffloat[pyarrow]
dtype: object

Additional Information

When converting from higher precision numerical dtypes (like float64) to lower precision (like float32), data precision might be compromised.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_pyarrow-0.2.1.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

pandas_pyarrow-0.2.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file pandas_pyarrow-0.2.1.tar.gz.

File metadata

  • Download URL: pandas_pyarrow-0.2.1.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pandas_pyarrow-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d16bf7825dc09892051e606bbb866f83352e46b8716762d82ebf4365a25e4c91
MD5 de105db3f24506ba840588781f976056
BLAKE2b-256 abd5f09e98445ca812ade8a19f24dabf90e8ab64304888129f6956ea21549079

See more details on using hashes here.

File details

Details for the file pandas_pyarrow-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pandas_pyarrow-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.3 Linux/6.8.0-1021-azure

File hashes

Hashes for pandas_pyarrow-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b12f3353ba30d3286da0129579b384eced0b05ae64ab8d3da5742e7d9a31a117
MD5 366f87e74cd307d738d0c30e1b7b116a
BLAKE2b-256 898c46ba3e528089222bb0ba3ea7df8fb8ef6639939d0a45d1ad4daff7513783

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page