A library for switching pandas backend to pyarrow
Project description
pandas-pyarrow
pandas-pyarrow
simplifies the conversion of pandas backend to pyarrow, allowing seamlessly switch to pyarrow pandas
backend.
Get started:
Installation
To install the package use pip:
pip install pandas-pyarrow
Usage
import pandas as pd
from pandas_pyarrow import convert_to_pyarrow
# Create a pandas DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3],
'D': [True, False, True]
})
# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = convert_to_pyarrow(df)
print(adf.dtypes)
outputs:
A int64[pyarrow]
B string[pyarrow]
C double[pyarrow]
D bool[pyarrow]
dtype: object
Furthermore, it's possible to add mappings or override existing ones:
import pandas as pd
from pandas_pyarrow import PandasArrowConverter
# Create a pandas DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': [1.1, 2.2, 3.3],
'D': [True, False, True]
})
# Instantiate a PandasArrowConverter object
pandas_pyarrow_converter = PandasArrowConverter(
custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})
# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = pandas_pyarrow_converter(df)
print(adf.dtypes)
outputs:
A int32[pyarrow]
B string[pyarrow]
C float[pyarrow]
D bool[pyarrow]
dtype: object
pandas-pyarrow also support db-dtypes used by bigquery python sdk:
pip install pandas-gbq
or
pip install pandas-pyarrow[bigquery]
import pandas_gbq as gbq
from pandas_pyarrow import PandasArrowConverter
# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"
# Construct the query string
query = """
SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""
# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
pandas_pyarrow_converter = PandasArrowConverter()
adf = pandas_pyarrow_converter(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)
outputs:
unique_key object
complaint_description object
source object
status object
status_change_date datetime64[us, UTC]
created_date datetime64[us, UTC]
last_update_date datetime64[us, UTC]
close_date datetime64[us, UTC]
incident_address object
street_number object
street_name object
city object
incident_zip Int64
county object
state_plane_x_coordinate object
state_plane_y_coordinate float64
latitude float64
longitude float64
location object
council_district_code Int64
map_page object
map_tile object
dtype: object
unique_key string[pyarrow]
complaint_description string[pyarrow]
source string[pyarrow]
status string[pyarrow]
status_change_date timestamp[us][pyarrow]
created_date timestamp[us][pyarrow]
last_update_date timestamp[us][pyarrow]
close_date timestamp[us][pyarrow]
incident_address string[pyarrow]
street_number string[pyarrow]
street_name string[pyarrow]
city string[pyarrow]
incident_zip int64[pyarrow]
county string[pyarrow]
state_plane_x_coordinate string[pyarrow]
state_plane_y_coordinate double[pyarrow]
latitude double[pyarrow]
longitude double[pyarrow]
location string[pyarrow]
council_district_code int64[pyarrow]
map_page string[pyarrow]
map_tile string[pyarrow]
dtype: object
Purposes
- Simplify the conversion between pandas pyarrow and numpy backends.
- Allow seamlessly switch to pyarrow pandas backend, even for problematic dtypes such float16 or db-dtypes.
- dtype standardization for db-dtypes used by bigquery python sdk.
example:
import pandas as pd
# Create a pandas DataFrame
df = pd.DataFrame({
'C': [1.1, 2.2, 3.3],
}, dtype='float16')
df.convert_dtypes(dtype_backend='pyarrow')
will raise an error:
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from halffloat to double using function cast_double
but with pandas-pyarrow:
import pandas as pd
from pandas_pyarrow import convert_to_pyarrow
# Create a pandas DataFrame
df = pd.DataFrame({
'C': [1.1, 2.2, 3.3],
}, dtype='float16')
adf = convert_to_pyarrow(df)
print(adf.dtypes)
outputs:
C halffloat[pyarrow]
dtype: object
Additional Information
When converting from higher precision numerical dtypes (like float64) to lower precision (like float32), data precision might be compromised.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pandas_pyarrow-0.1.8.tar.gz
.
File metadata
- Download URL: pandas_pyarrow-0.1.8.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1024-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 373263262679e866896f8146ec5fbf2f9ab1ba475d1db82c19c57ad78aa91b01 |
|
MD5 | 93cea37ecc2451442786146584236088 |
|
BLAKE2b-256 | d4bd5b2e45f32d80f9fd35e7cffeb56736644b60543a218e033a55296df5f03d |
File details
Details for the file pandas_pyarrow-0.1.8-py3-none-any.whl
.
File metadata
- Download URL: pandas_pyarrow-0.1.8-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1024-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c31b0cea046d87694a6cd8b2b33224bcf25872d4ef71183a06a74028d1dcf1c |
|
MD5 | bf167ef51b0f45e1f4b5593fbe3c817f |
|
BLAKE2b-256 | 291ba5d3edad495676c25b2b54badcaf4e698d37ee218518efe67fd9c030d6c6 |