Skip to main content

Ergonomic wrapper for pandas_gbq that simplifies loading BigQuery data into DataFrames

Project description

bqdf

Usage

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/motdam/bqdf.git

or from conda

$ conda install -c motdam bqdf

or from pypi

$ pip install bqdf

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

How to use

This lib provides convenience functions for streamlining the interface of the pandas-gbq library to perform CRUD operations in BigQuery more quickly

import pandas_gbq
import pandas as pd
top_terms_query = """
-- todays top 10 search terms in England
SELECT refresh_date, rank, term, score, percent_gain / 100 as percent_gain, country_name, week
FROM `bigquery-public-data.google_trends.international_top_rising_terms` 
WHERE country_name = 'United Kingdom'
  and refresh_date = current_date - 1
  and region_name = 'England'
order by refresh_date desc, week desc, rank
limit 5
"""

Reading a BigQuery table

df = read(top_terms_query, project_id='bq-sandbox-motdam')
df.head()
Downloading:   0%|          |Downloading: 100%|██████████|
Loaded 5 rows × 7 cols (0.0000 GB) from query in 0.92s
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   refresh_date  5 non-null      datetime64[ns]
 1   rank          5 non-null      Int64         
 2   term          5 non-null      object        
 3   score         5 non-null      Int64         
 4   percent_gain  5 non-null      Float64       
 5   country_name  5 non-null      object        
 6   week          5 non-null      dbdate        
dtypes: Float64(1), Int64(2), datetime64[ns](1), dbdate(1), object(2)
memory usage: 427.0+ bytes
None
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
refresh_date rank term score percent_gain country_name week
0 2025-11-27 1 liverpool vs psv 100 63.5 United Kingdom 2025-11-23
1 2025-11-27 2 psg vs tottenham 70 32.5 United Kingdom 2025-11-23
2 2025-11-27 3 richard branson 100 32.0 United Kingdom 2025-11-23
3 2025-11-27 4 arsenal vs bayern 100 31.5 United Kingdom 2025-11-23
4 2025-11-27 5 rinky dink 100 30.5 United Kingdom 2025-11-23

To recreate the above with the original library you would need the below boiler plate to inspect the results and convert columns into pandas friendly dtypes.

df = pandas_gbq.read_gbq(top_terms_query, project_id='bq-sandbox-motdam')
df = df.astype({
    'percent_gain':'Float64'
})
df['week'] = pd.to_datetime(df['week'])
df['refresh_date'] = pd.to_datetime(df['refresh_date'])
print(df.info())
df.head()
Downloading:   0%|          |Downloading: 100%|██████████|
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   refresh_date  5 non-null      datetime64[ns]
 1   rank          5 non-null      Int64         
 2   term          5 non-null      object        
 3   score         5 non-null      Int64         
 4   percent_gain  5 non-null      Float64       
 5   country_name  5 non-null      object        
 6   week          5 non-null      datetime64[ns]
dtypes: Float64(1), Int64(2), datetime64[ns](2), object(2)
memory usage: 427.0+ bytes
None
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
refresh_date rank term score percent_gain country_name week
0 2025-11-27 1 liverpool vs psv 100 63.5 United Kingdom 2025-11-23
1 2025-11-27 2 psg vs tottenham 70 32.5 United Kingdom 2025-11-23
2 2025-11-27 3 richard branson 100 32.0 United Kingdom 2025-11-23
3 2025-11-27 4 arsenal vs bayern 100 31.5 United Kingdom 2025-11-23
4 2025-11-27 5 rinky dink 100 30.5 United Kingdom 2025-11-23

Writing a df to BigQuery

The rest to function is unchanged beyond removing the redundant _gbq suffix. We can write our df back into BigQuery using hte to function.

# Write the dataframe to a temporary table
to(df, 'bq-sandbox-motdam.temporary.top_10_eng_search_terms', if_exists='replace')
  0%|          | 0/1 [00:00<?, ?it/s]100%|██████████| 1/1 [00:00<00:00, 5915.80it/s]

Sent 5 rows × 7 cols (0.0000 GB) to bq-sandbox-motdam.temporary.top_10_eng_search_terms in 5.44s

Executing SQL in BigQuery

The ex fucntion enables non df based CRUD operations within the same api which can be useful for creating feature processing pipelines.

project = 'bq-sandbox-motdam'

def create_top_terms(period, days):
    return f"""
    CREATE OR REPLACE TABLE `{project}.temporary.top_terms_{period}` AS
    WITH ranked AS (
      SELECT region_name, term, COUNT(*) as appearances, AVG(rank) as avg_rank,
        ROW_NUMBER() OVER (PARTITION BY region_name ORDER BY COUNT(*) DESC, AVG(rank)) as rn
      FROM `bigquery-public-data.google_trends.international_top_rising_terms`
      WHERE country_name = 'United Kingdom'
        AND region_name IN ('England', 'Scotland', 'Wales', 'Northern Ireland')
        AND refresh_date BETWEEN CURRENT_DATE() - {days} AND CURRENT_DATE()
        AND rank <= 100
      GROUP BY region_name, term
    )
    SELECT region_name, term as top_term_{period}
    FROM ranked WHERE rn = 1
    """

ex(create_top_terms('today', 1), project_id=project)
ex(create_top_terms('week', 8), project_id=project)
ex(create_top_terms('month', 31), project_id=project)
ex(create_top_terms('year', 366), project_id=project)

final_query = f"""
SELECT t.region_name, t.top_term_today, w.top_term_week, m.top_term_month, y.top_term_year
FROM `{project}.temporary.top_terms_today` as t
JOIN `{project}.temporary.top_terms_week` as w ON t.region_name = w.region_name
JOIN `{project}.temporary.top_terms_month` as m ON t.region_name = m.region_name
JOIN `{project}.temporary.top_terms_year` as y ON t.region_name = y.region_name
ORDER BY t.region_name
"""

read(final_query, project_id=project)
Processed 0.3874 GB, 0 rows affected in 2.80s
Processed 3.0023 GB, 0 rows affected in 2.60s
Processed 11.6263 GB, 0 rows affected in 2.23s
Processed 12.1607 GB, 0 rows affected in 2.29s
Downloading:   0%|          |Downloading: 100%|██████████|
Loaded 4 rows × 5 cols (0.0000 GB) from query in 0.95s
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   region_name     4 non-null      object
 1   top_term_today  4 non-null      object
 2   top_term_week   4 non-null      object
 3   top_term_month  4 non-null      object
 4   top_term_year   4 non-null      object
dtypes: object(5)
memory usage: 292.0+ bytes
None
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
region_name top_term_today top_term_week top_term_month top_term_year
0 England liverpool vs psv man united vs everton ftse 100 ftse 100
1 Northern Ireland liverpool vs psv man united vs everton ftse 100 ftse 100
2 Scotland liverpool vs psv man united vs everton ftse 100 ftse 100
3 Wales liverpool vs psv man united vs everton ftse 100 ftse 100

British search history in a nutshell: ‘Is it raining?’ followed immediately by ‘Can I afford to move somewhere sunny?’

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install bqdf in Development mode

# make sure bqdf package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to bqdf
$ nbdev_prepare

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bqdf-0.0.4.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bqdf-0.0.4-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file bqdf-0.0.4.tar.gz.

File metadata

  • Download URL: bqdf-0.0.4.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bqdf-0.0.4.tar.gz
Algorithm Hash digest
SHA256 0dd3053ba8f11efbd05d15935f3ab0daf41975c0a319db0d1eb744fb57afbf02
MD5 16a63ea6e50e0a18979ea046ab3afe1e
BLAKE2b-256 ae96b5112f4960150a01b2699ececdc998967cbe2e9869d6024662c2cbd43ebf

See more details on using hashes here.

File details

Details for the file bqdf-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: bqdf-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for bqdf-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c32c88cfae5858909d839646334a6437238d2d63ebf9e066ca30486b4f1a962d
MD5 4273464789f3785e50f329ca963a9a1b
BLAKE2b-256 dcf23e81b866c32407d49d8561917c24f426a5d855679de4218484fc9db2d347

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page