Skip to main content

A library for switching pandas backend to pyarrow

Project description

SchemArrow

PyPI - Python Version version License OS OS OS Code Checks Tests Codecov

SchemArrow simplifies the conversion between pandas and Arrow DataFrames, allowing seamlessly switch to pyarrow pandas backend.

Get started:

Installation

To install the package use pip:

pip install schemarrow

Usage

import pandas as pd

from schemarrow import SchemArrow

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Instantiate a SchemArrow object
arrow_schema = SchemArrow()

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = arrow_schema(df)

print(adf.dtypes)

outputs:

A     int64[pyarrow]
B    string[pyarrow]
C    double[pyarrow]
D      bool[pyarrow]
dtype: object

Furthermore, it's possible to add mappings or override existing ones:

import pandas as pd

from schemarrow import SchemArrow

# Create a pandas DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c'],
    'C': [1.1, 2.2, 3.3],
    'D': [True, False, True]
})

# Instantiate a SchemArrow object
arrow_schema = SchemArrow(custom_mapper={'int64': 'int32[pyarrow]', 'float64': 'float32[pyarrow]'})

# Convert the pandas DataFrame dtypes to arrow dtypes
adf: pd.DataFrame = arrow_schema(df)

print(adf.dtypes)

outputs:

A     int32[pyarrow]
B    string[pyarrow]
C     float[pyarrow]
D      bool[pyarrow]
dtype: object

SchmeArrow also support db-dtypes used by bigquery python sdk:

pip install pandas-gbq
import pandas_gbq as gbq

from schemarrow.schema_arrow import SchemArrow

# Specify the public dataset and table you want to query
dataset_id = "bigquery-public-data"
table_name = "hacker_news.stories"

# Construct the query string
query = """
    SELECT * FROM `bigquery-public-data.austin_311.311_service_requests` LIMIT 1000
"""

# Use pandas_gbq to read the data from BigQuery
df = gbq.read_gbq(query)
schema_arrow = SchemArrow()
adf = schema_arrow(df)
# Print the retrieved data
print(df.dtypes)
print(adf.dtypes)

outputs:

unique_key                               object
complaint_description                    object
source                                   object
status                                   object
status_change_date          datetime64[us, UTC]
created_date                datetime64[us, UTC]
last_update_date            datetime64[us, UTC]
close_date                  datetime64[us, UTC]
incident_address                         object
street_number                            object
street_name                              object
city                                     object
incident_zip                              Int64
county                                   object
state_plane_x_coordinate                 object
state_plane_y_coordinate                float64
latitude                                float64
longitude                               float64
location                                 object
council_district_code                     Int64
map_page                                 object
map_tile                                 object
dtype: object
unique_key                         string[pyarrow]
complaint_description              string[pyarrow]
source                             string[pyarrow]
status                             string[pyarrow]
status_change_date          timestamp[us][pyarrow]
created_date                timestamp[us][pyarrow]
last_update_date            timestamp[us][pyarrow]
close_date                  timestamp[us][pyarrow]
incident_address                   string[pyarrow]
street_number                      string[pyarrow]
street_name                        string[pyarrow]
city                               string[pyarrow]
incident_zip                        int64[pyarrow]
county                             string[pyarrow]
state_plane_x_coordinate           string[pyarrow]
state_plane_y_coordinate           double[pyarrow]
latitude                           double[pyarrow]
longitude                          double[pyarrow]
location                           string[pyarrow]
council_district_code               int64[pyarrow]
map_page                           string[pyarrow]
map_tile                           string[pyarrow]
dtype: object

Purposes

  • Simplify the conversion between pandas pyarrow and numpy backends.
  • Allow seamlessly switch to pyarrow pandas backend.
  • dtype standardization for db-dtypes used by bigquery python sdk.

Additional Information

When converting from higher precision numerical dtypes (like float64) to lower precision (like float32), data precision might be compromised.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemarrow-0.1.1a0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

schemarrow-0.1.1a0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file schemarrow-0.1.1a0.tar.gz.

File metadata

  • Download URL: schemarrow-0.1.1a0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-1015-azure

File hashes

Hashes for schemarrow-0.1.1a0.tar.gz
Algorithm Hash digest
SHA256 cb90b9930e78434259887cfc4a762758725af1751259bef6f75626fb7fe83467
MD5 c457065fe60b67b0ddb191b0d9a66729
BLAKE2b-256 a2987394e35fe176ac96affb4da49cc52a0f2b9c6a8ed14bd947a848c3959902

See more details on using hashes here.

File details

Details for the file schemarrow-0.1.1a0-py3-none-any.whl.

File metadata

  • Download URL: schemarrow-0.1.1a0-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-1015-azure

File hashes

Hashes for schemarrow-0.1.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 18ceaa5d39e7360cb3858370ac8d9399360ce606142d9b1939d5d86e3dc5371c
MD5 21ad87d989856f19732f4ebb78dfb341
BLAKE2b-256 fc5e38f8367b2336d866b2746d0b1de84b0ffb5f22ff74eba6d6246e56ca713d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page