Apache Airflow provider for Google Sheets — read, write, and smart merge
Project description
airflow-provider-google-sheets
Apache Airflow provider for Google Sheets API v4. Read, write, and manage Google Sheets spreadsheets from Airflow DAGs.
AI Disclosure: This provider was developed with the assistance of Claude Code (Anthropic, model Claude Opus 4.6). The code, tests, and documentation were co-authored by a human developer and an LLM. Please evaluate the code quality on its own merits and make informed decisions about whether to use it in your projects.
Features
- Read data from Google Sheets with chunked streaming, schema-based type conversion, and CSV/JSON/XCom output
- Write data in three modes: overwrite, append, and smart merge (upsert by key)
- Smart merge — update, insert, and delete rows based on a key column with correct index recalculation
- Manage spreadsheets — create new spreadsheets and sheets
- Large datasets — streaming read/write without loading everything into memory
- Schema support — automatic type conversion (date, int, float, bool) on read and write
- Header processing — deduplication, Cyrillic transliteration, snake_case normalization
Installation
pip install airflow-provider-google-sheets
With Cyrillic header transliteration support:
pip install airflow-provider-google-sheets[transliterate]
Requirements
- Python >= 3.10
- Apache Airflow 2.x (>= 2.7, tested on 2.9.1; Airflow 3.x not tested)
- Google service account with Sheets API access
Connection Setup
- Create a Google Cloud service account with Google Sheets API enabled.
- Download the JSON key file.
- In Airflow UI, create a connection with one of the supported configurations:
Option A: Standard Google Cloud connection (recommended)
Use this if you already have a google_cloud_platform connection configured in Airflow.
- Conn Id:
google_cloud_default - Conn Type:
Google Cloud - Keyfile Path:
/path/to/service-account.json - Scopes:
https://www.googleapis.com/auth/spreadsheets(add more if needed)
Option B: Inline JSON key
- Conn Id:
google_cloud_default - Conn Type:
Google Cloudorgoogle_sheets - Keyfile JSON: paste the full service account JSON
Option C: JSON in Extra field
- Conn Id:
google_cloud_default - Conn Type:
google_sheets - Extra: paste the full JSON key, or use
{"keyfile_dict": <JSON key>}
The hook checks credentials in this order: key_path / keyfile_path (file on disk) → keyfile_dict (inline JSON) → raw Extra JSON.
Operators
GoogleSheetsReadOperator
Read data from a spreadsheet.
from airflow_provider_google_sheets.operators.read import GoogleSheetsReadOperator
# Basic read — returns list[dict] via XCom
read = GoogleSheetsReadOperator(
task_id="read_sheets",
spreadsheet_id="your-spreadsheet-id",
sheet_name="Sheet1",
)
# Stream large sheet to CSV file (no memory accumulation)
read_csv = GoogleSheetsReadOperator(
task_id="read_to_csv",
spreadsheet_id="your-spreadsheet-id",
output_type="csv",
output_path="/tmp/export.csv",
chunk_size=10000,
)
# Read with type conversion
read_typed = GoogleSheetsReadOperator(
task_id="read_typed",
spreadsheet_id="your-spreadsheet-id",
schema={
"date": {"type": "date", "format": "%Y-%m-%d"},
"revenue": {"type": "float", "required": True},
"quantity": {"type": "int"},
},
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
gcp_conn_id |
str | "google_cloud_default" |
Airflow Connection ID |
spreadsheet_id |
str | — | Spreadsheet ID |
sheet_name |
str | None |
Sheet name (None = first sheet) |
cell_range |
str | None |
A1-notation range (None = entire sheet) |
has_headers |
bool | True |
First row contains headers |
transliterate_headers |
bool | False |
Transliterate Cyrillic to Latin |
normalize_headers |
bool | False |
Normalize to snake_case |
schema |
dict | None |
Column type schema |
chunk_size |
int | 5000 |
Rows per API request |
output_type |
str | "xcom" |
"xcom", "csv", or "json" |
output_path |
str | None |
File path for csv/json output |
max_xcom_rows |
int | 50000 |
Max rows for XCom output |
GoogleSheetsWriteOperator
Write data to a spreadsheet.
from airflow_provider_google_sheets.operators.write import GoogleSheetsWriteOperator
# Overwrite with list[dict]
write = GoogleSheetsWriteOperator(
task_id="write_sheets",
spreadsheet_id="your-spreadsheet-id",
sheet_name="Output",
write_mode="overwrite",
data=[{"date": "2024-01-01", "value": 100}],
)
# Append rows
append = GoogleSheetsWriteOperator(
task_id="append_sheets",
spreadsheet_id="your-spreadsheet-id",
write_mode="append",
data=[{"event": "login", "user": "alice"}],
)
# Smart merge by key
merge = GoogleSheetsWriteOperator(
task_id="smart_merge",
spreadsheet_id="your-spreadsheet-id",
write_mode="smart_merge",
merge_key="date",
data=[
{"date": "2024-01-01", "value": 110}, # update existing
{"date": "2024-01-03", "value": 200}, # append new
],
)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
gcp_conn_id |
str | "google_cloud_default" |
Airflow Connection ID |
spreadsheet_id |
str | — | Spreadsheet ID |
sheet_name |
str | None |
Sheet name |
cell_range |
str | None |
Target A1 range (overwrite mode) |
write_mode |
str | "overwrite" |
"overwrite", "append", "smart_merge" |
data |
Any | None |
Data: list[list], list[dict], or file path |
data_xcom_task_id |
str | None |
Pull data from this task's XCom |
data_xcom_key |
str | "return_value" |
XCom key |
has_headers |
bool | True |
Data contains headers |
write_headers |
bool | True |
Write header row (overwrite mode) |
schema |
dict | None |
Schema for formatting values |
batch_size |
int | 1000 |
Rows per API request |
pause_between_batches |
float | 1.0 |
Seconds between batches |
merge_key |
str | None |
Key column for smart_merge |
Data input formats:
list[dict]— headers auto-detected from keyslist[list]— raw rows (sethas_headers=Trueif first row is header)str— path to CSV or JSON file- XCom — set
data_xcom_task_id
Smart Merge Algorithm
Smart merge reads the key column from the sheet, compares with incoming data, and generates minimal operations:
- Read the key column to build an index
{key_value: [row_numbers]} - Compare each key: same count → update, more incoming → insert, fewer → delete, new key → append
- Sort structural operations bottom-up (descending row number) to prevent index corruption
- Execute inserts/deletes via
batchUpdate, then recalculate row indices for value updates - Write values via
batch_update_valuesfor efficiency
GoogleSheetsCreateSpreadsheetOperator
from airflow_provider_google_sheets.operators.manage import GoogleSheetsCreateSpreadsheetOperator
create = GoogleSheetsCreateSpreadsheetOperator(
task_id="create_spreadsheet",
title="Monthly Report",
sheet_titles=["Summary", "Details"],
)
# Returns spreadsheet_id via XCom
GoogleSheetsCreateSheetOperator
from airflow_provider_google_sheets.operators.manage import GoogleSheetsCreateSheetOperator
add_sheet = GoogleSheetsCreateSheetOperator(
task_id="add_sheet",
spreadsheet_id="your-spreadsheet-id",
sheet_title="NewSheet",
)
Schema
Define column types for automatic conversion on read/write:
schema = {
"date": {"type": "date", "format": "%Y-%m-%d", "required": True},
"revenue": {"type": "float", "required": True},
"quantity": {"type": "int"},
"comment": {"type": "str"},
"is_active": {"type": "bool"},
}
Supported types: str, int, float, date, datetime, bool
Examples
See the examples/ directory for complete DAG examples:
example_read.py— reading with various configurationsexample_write.py— overwrite and append modesexample_smart_merge.py— smart merge scenariosexample_manage.py— creating spreadsheets and sheetsexample_sheets_to_bigquery.py— Google Sheets → BigQuery (overwrite, append, date-range update)example_bigquery_to_sheets.py— BigQuery → Google Sheets (overwrite, smart merge by date)
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airflow_provider_google_sheets-0.1.5.tar.gz.
File metadata
- Download URL: airflow_provider_google_sheets-0.1.5.tar.gz
- Upload date:
- Size: 52.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0c946cc285d5bb622d471d0984ea9945b8f95c7735e3e800f34df40ffba7e53
|
|
| MD5 |
9acb78c037cb82aef34333121fd1b56f
|
|
| BLAKE2b-256 |
ce3f00814f4c08ed3c49011e16383cc0fccc7b10ac0f7ce81806896b03278a07
|
Provenance
The following attestation bundles were made for airflow_provider_google_sheets-0.1.5.tar.gz:
Publisher:
publish.yml on mkozhin/airflow-provider-google-sheets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airflow_provider_google_sheets-0.1.5.tar.gz -
Subject digest:
f0c946cc285d5bb622d471d0984ea9945b8f95c7735e3e800f34df40ffba7e53 - Sigstore transparency entry: 991907415
- Sigstore integration time:
-
Permalink:
mkozhin/airflow-provider-google-sheets@ea3fb5a0db3010ba7810cbb1431be4fd2a3d3c2b -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/mkozhin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea3fb5a0db3010ba7810cbb1431be4fd2a3d3c2b -
Trigger Event:
push
-
Statement type:
File details
Details for the file airflow_provider_google_sheets-0.1.5-py3-none-any.whl.
File metadata
- Download URL: airflow_provider_google_sheets-0.1.5-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3aeea8a750b6ee0c6b39f917f986f1aada2ce31d83de219cf53ef33b6bdd157
|
|
| MD5 |
25d3cb7fab2f1f2284f5dc532dd8f58f
|
|
| BLAKE2b-256 |
04e67610bef2d466ad6940ef638b5b1c716f71c812a9c313ac4b70d35202049e
|
Provenance
The following attestation bundles were made for airflow_provider_google_sheets-0.1.5-py3-none-any.whl:
Publisher:
publish.yml on mkozhin/airflow-provider-google-sheets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airflow_provider_google_sheets-0.1.5-py3-none-any.whl -
Subject digest:
a3aeea8a750b6ee0c6b39f917f986f1aada2ce31d83de219cf53ef33b6bdd157 - Sigstore transparency entry: 991907420
- Sigstore integration time:
-
Permalink:
mkozhin/airflow-provider-google-sheets@ea3fb5a0db3010ba7810cbb1431be4fd2a3d3c2b -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/mkozhin
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea3fb5a0db3010ba7810cbb1431be4fd2a3d3c2b -
Trigger Event:
push
-
Statement type: