A simple library to write to and read from BigQuery tables as PyArrow tables.

Project description

pyarrow-bigquery

A simple library to write to and download from BigQuery tables as PyArrow tables.

Installation

pip install pyarrow-bigquery

Quick Start

This guide will help you quickly get started with pyarrow-bigquery, a library that allows you to read from and write to Google BigQuery using PyArrow.

Reading

pyarrow-bigquery exposes two methods to read BigQuery tables as PyArrow tables. Depending on your use case or the size of the table, you might want to use one method over the other.

Read the Whole Table

When the table is small enough to fit in memory, you can read it directly using bq.read_table.

import pyarrow.bigquery as bq

table = bq.read_table("gcp_project.dataset.small_table")

print(table.num_rows)

Read with Batches

If the target table is larger than memory or you have other reasons not to fetch the whole table at once, you can use the bq.reader iterator method along with the batch_size parameter to limit how much data is fetched per iteration.

import pyarrow.bigquery as bq

for table in bq.reader("gcp_project.dataset.big_table", batch_size=100):
    print(table.num_rows)

Writing

Similarly, the package exposes two methods to write to BigQuery. Depending on your use case or the size of the table, you might want to use one method over the other.

Write the Whole Table

When you want to write a complete table at once, you can use the bq.write_table method.

import pyarrow as pa
import pyarrow.bigquery as bq

table = pa.Table.from_arrays([[1, 2, 3, 4]], names=['integers'])

bq.write_table(table, 'gcp_project.dataset.table')

Write in Batches (Smaller Chunks)

If you need to write data in smaller chunks, you can use the bq.writer method with the schema parameter to define the table structure.

import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([
    ("integers", pa.int64())
])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    w.write_batch(record_batch)
    w.write_table(table)

API Reference

Writing

`pyarrow.bigquery.write_table`

Write a PyArrow Table to a BigQuery Table. No return value.

Parameters:

table: pa.Table
PyArrow table.
where: str
Destination location in BigQuery catalog.
project: str, default None
BigQuery execution project, also the billing project. If not provided, it will be extracted from where.
table_create: bool, default True
Specifies if the BigQuery table should be created.
table_expire: None | int, default None
Amount of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.
table_overwrite: bool, default False
If the table already exists, destroy it and create a new one.
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for fetching data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for fetching data from BigQuery.
batch_size: int, default 100
Batch size for fetched rows.

bq.write_table(table, 'gcp_project.dataset.table')

`pyarrow.bigquery.writer` (Context manager)

Context manager version of the write method. Useful when the PyArrow table is larger than memory size or the table is available in chunks.

Parameters:

schema: pa.Schema
PyArrow schema.
where: str
Destination location in BigQuery catalog.
project: str, default None
BigQuery execution project, also the billing project. If not provided, it will be extracted from where.
table_create: bool, default True
Specifies if the BigQuery table should be created.
table_expire: None | int, default None
Amount of seconds after which the created table will expire. Used only if table_create is True. Set to None to disable expiration.
table_overwrite: bool, default False
If the table already exists, destroy it and create a new one.
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for writing data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for writing data to BigQuery.
batch_size: int, default 100
Batch size used for writes. Table will be automatically split to this value.

Depending on the use case, you might want to use one of the methods below to write your data to a BigQuery table, using either pa.Table or pa.RecordBatch.

`pyarrow.bigquery.writer.write_table` (Context Manager Method)

Context manager method to write a table.

Parameters:

table: pa.Table
PyArrow table.

import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    for a in range(1000):
        w.write_table(pa.Table.from_pylist([{'value': [a] * 10}]))

`pyarrow.bigquery.writer.write_batch` (Context Manager Method)

Context manager method to write a record batch.

Parameters:

batch: pa.RecordBatch
PyArrow record batch.

import pyarrow as pa
import pyarrow.bigquery as bq

schema = pa.schema([("value", pa.list_(pa.int64()))])

with bq.writer("gcp_project.dataset.table", schema=schema) as w:
    for a in range(1000):
        w.write_batch(pa.RecordBatch.from_pylist([{'value': [1] * 10}]))

Reading

`pyarrow.bigquery.read_table`

Parameters:

source: str
BigQuery table location.
project: str, default None
BigQuery execution project, also the billing project. If not provided, it will be extracted from source.
columns: str, default None
Columns to download. When not provided, all available columns will be downloaded.
row_restrictions: str, default None
Row level filtering executed on the BigQuery side. More in BigQuery documentation.
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for fetching data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for fetching data from BigQuery.
batch_size: int, default 100
Batch size used for fetching. Table will be automatically split to this value.

`pyarrow.bigquery.read_query`

Parameters:

project: str
BigQuery query execution (and billing project).
query: str
Query to be executed
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for fetching data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for fetching data from BigQuery.
batch_size: int, default 100
Batch size used for fetching. Table will be automatically split to this value.

table = bq.read_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`")

`pyarrow.bigquery.reader`

Parameters:

source: str
BigQuery table location.
project: str, default None
BigQuery execution project, also the billing project. If not provided, it will be extracted from source.
columns: str, default None
Columns to download. When not provided, all available columns will be downloaded.
row_restrictions: str, default None
Row level filtering executed on the BigQuery side. More in BigQuery documentation.
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for fetching data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for fetching data from BigQuery.
batch_size: int, default 100
Batch size used for fetching. Table will be automatically split to this value.

import pyarrow as pa
import pyarrow.bigquery as bq

parts = []
for part in bq.reader("gcp_project.dataset.table"):
    parts.append(part)

table = pa.concat_tables(parts)

`pyarrow.bigquery.reader_query`

Parameters:

project: str
BigQuery query execution (and billing project).
query: str
Query to be executed
worker_type: threading.Thread | multiprocessing.Process, default threading.Thread
Worker backend for fetching data.
worker_count: int, default os.cpu_count()
Number of threads or processes to use for fetching data from BigQuery.
batch_size: int, default 100
Batch size used for fetching. Table will be automatically split to this value.

for batch in bq.reader_query("gcp_project", "SELECT * FROM `gcp_project.dataset.table`"):
    print(batch.num_rows)

Project details

Release history Release notifications | RSS feed

0.5.4

Jun 29, 2024

0.5.3

Jun 29, 2024

0.5.0

Jun 26, 2024

0.4.0

Jun 17, 2024

This version

0.2.0

Jun 12, 2024

0.1.0

Jun 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyarrow_bigquery-0.2.0.tar.gz (10.8 kB view hashes)

Uploaded Jun 12, 2024 Source

Built Distribution

pyarrow_bigquery-0.2.0-py3-none-any.whl (11.1 kB view hashes)

Uploaded Jun 12, 2024 Python 3

Hashes for pyarrow_bigquery-0.2.0.tar.gz

Hashes for pyarrow_bigquery-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8e3028a0568ff9fea6200696a63e6c110555c8e766f37647271cdd7ed19afc03`
MD5	`c1d3d5eb8cf33bedcf38f5a7112ed13a`
BLAKE2b-256	`7f203790d255019876414c16fd3a74f98446d8096fdd8ce4d234a212dc3fecd7`

Hashes for pyarrow_bigquery-0.2.0-py3-none-any.whl

Hashes for pyarrow_bigquery-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`835a78d8d00d6c72973fab904ff2db2d6435682f529435cf1619beac5db57de7`
MD5	`b97b08d5cc097a9cc30119eea968419c`
BLAKE2b-256	`42d8ef944e5df1cf0834797e8565598298acfb18a23e8c546ddbad6d83393102`

pyarrow-bigquery 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pyarrow-bigquery

Installation

Quick Start

Reading

Writing

API Reference

Writing

`pyarrow.bigquery.write_table`

`pyarrow.bigquery.writer` (Context manager)

`pyarrow.bigquery.writer.write_table` (Context Manager Method)

`pyarrow.bigquery.writer.write_batch` (Context Manager Method)

Reading

`pyarrow.bigquery.read_table`

`pyarrow.bigquery.read_query`

`pyarrow.bigquery.reader`

`pyarrow.bigquery.reader_query`

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

pyarrow-bigquery 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pyarrow-bigquery

Installation

Quick Start

Reading

Writing

API Reference

Writing

pyarrow.bigquery.write_table

pyarrow.bigquery.writer (Context manager)

pyarrow.bigquery.writer.write_table (Context Manager Method)

pyarrow.bigquery.writer.write_batch (Context Manager Method)

Reading

pyarrow.bigquery.read_table

pyarrow.bigquery.read_query

pyarrow.bigquery.reader

pyarrow.bigquery.reader_query

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

`pyarrow.bigquery.write_table`

`pyarrow.bigquery.writer` (Context manager)

`pyarrow.bigquery.writer.write_table` (Context Manager Method)

`pyarrow.bigquery.writer.write_batch` (Context Manager Method)

`pyarrow.bigquery.read_table`

`pyarrow.bigquery.read_query`

`pyarrow.bigquery.reader`

`pyarrow.bigquery.reader_query`