Skip to main content

A Django plugin for exporting CMS data to Google BigQuery.

Project description

A Django application that provides a convenient way to export data from your Django models to Google BigQuery.

Features

  • Exports Django model data to BigQuery tables

  • Processes data in configurable batch sizes to manage memory usage

  • Handles date/time formats and UUID fields automatically

  • Allows custom field transformations with a simple decorator

  • Validates that model fields match BigQuery table schema

  • Provides retry mechanisms for resilient exports

  • Supports incremental exports with date filtering

  • Handles potential exceptions during data export with detailed error reporting

Installation

pip install django-bigquery-exporter

Requirements

  • Python 3.8+

  • Django

  • google-cloud-bigquery

  • google-api-python-client

Authentication

You need to authenticate with Google Cloud to use BigQuery. There are two main ways:

  1. Using environment variables (recommended for production):

    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"
  2. Providing credentials directly in code (useful for development):

    exporter = MyExporter(
        project="your-google-cloud-project-id",
        credentials="/path/to/your/credentials.json"
    )

Basic Usage

Create a subclass of BigQueryExporter and define the necessary attributes:

from bigquery_exporter.base import BigQueryExporter, custom_field

class BookExporter(BigQueryExporter):
    model = Book
    fields = ['id', 'title', 'author', 'publication_date', 'genre', 'rating']
    batch = 1000
    table_name = 'your_project.your_dataset.books'
    replace_nulls_with_empty = False

    @custom_field
    def genre(self, instance):
        """Custom field to transform the genre into a structured format"""
        return {
            'code': instance.genre,
            'name': instance.get_genre_display()
        }

Then, export the data:

exporter = BookExporter()
exporter.export()

Available Properties

model:

Django model to export (required), default: None

fields:

List of field names to export (required), default: []

batch:

Number of records to process in each batch, default: 1000

table_name:

Full BigQuery table name (required), default: ''

replace_nulls_with_empty:

Whether to replace None values with empty strings, default: False

include_pull_date:

Whether to include pull date in the export, default: False

pull_date_field_name:

Name of the field to store the export timestamp, default: 'pull_date'

Available Methods

define_queryset()

Define the queryset to export. Override this method to filter or order your data:

def define_queryset(self):
    # Only export books published in the last year
    one_year_ago = datetime.date.today() - datetime.timedelta(days=365)
    return self.model.objects.filter(publication_date__gte=one_year_ago).order_by('id')

export(pull_date=None, queryset=None)

Export data to BigQuery.

  • pull_date: Optional timestamp to record when the data was exported (only included if include_pull_date=True)

  • queryset: Optional queryset to override the default. Useful for backfilling specific data.

# Standard export
exporter = BookExporter()
errors = exporter.export()

# Export with specified pull_date
from datetime import datetime
exporter.export(pull_date=datetime.now())

# Backfilling specific data
historical_queryset = Book.objects.filter(
    publication_date__year=2020
).order_by('id')
exporter.export(queryset=historical_queryset)

if errors:
    print(f"Encountered {len(errors)} errors during export")

table_has_data(pull_date=None)

Check if the BigQuery table has data. When both pull_date is provided AND include_pull_date is True, it checks for data with that specific pull date. Otherwise, it just checks if the table has any data at all.

exporter = BookExporter()

# Check with explicit pull date (only works if include_pull_date=True)
pull_date = datetime.datetime.now()
if not exporter.table_has_data(pull_date):
    exporter.export(pull_date=pull_date)
else:
    print("Data already exported for today")

# Check for any data
if not exporter.table_has_data():
    exporter.export()
else:
    print("Table already has data")

Dependency Injection

Django BigQuery Exporter supports injection of the BigQuery client for better testability and flexibility:

# Injecting a custom BigQuery client
from google.cloud import bigquery
custom_client = bigquery.Client(project='my-project')

exporter = BookExporter(
    client=custom_client
)

Custom Fields

Use the @custom_field decorator to create methods that transform data during export:

@custom_field
def full_name(self, instance):
    return f"{instance.first_name} {instance.last_name}"

@custom_field
def category_details(self, instance):
    # Return complex nested data
    return {
        'id': instance.category_id,
        'name': instance.category.name,
        'parent': instance.category.parent.name if instance.category.parent else None
    }

Complete Example

Here’s a complete example with a Book model:

import datetime
from bigquery_exporter.base import BigQueryExporter, custom_field
from myapp.models import Book

class BookExporter(BigQueryExporter):
    model = Book
    batch = 1000
    table_name = 'my_project.bookstore.books'
    fields = [
        'id', 'title', 'author', 'publication_date', 'is_bestseller',
        'genre', 'page_count', 'created_at', 'updated_at', 'rating'
    ]
    # Pull date configuration
    include_pull_date = True             # Include pull date in the export
    pull_date_field_name = 'export_date' # Custom field name

    def define_queryset(self):
        # Only export books updated in the last 30 days
        thirty_days_ago = datetime.date.today() - datetime.timedelta(days=30)
        return Book.objects.filter(updated_at__gte=thirty_days_ago).order_by('id')

    @custom_field
    def genre(self, instance):
        """Return both the code and display name for the genre"""
        GENRES = {
            'SFF': 'Science Fiction & Fantasy',
            'MYS': 'Mystery',
            'ROM': 'Romance',
            # ... other genres
        }
        return {
            'code': instance.genre,
            'name': GENRES.get(instance.genre, 'Unknown')
        }

    @custom_field
    def rating(self, instance):
        """Calculate and return the average rating"""
        avg_rating = instance.reviews.aggregate(avg=Avg('rating'))['avg'] or 0
        return round(avg_rating, 1)

# In a task or management command
def export_books_to_bigquery():
    pull_date = datetime.datetime.now()

    exporter = BookExporter(
        project='my-gcp-project',
        credentials='/path/to/credentials.json'
    )

    # Check if data already exists for today
    if exporter.table_has_data(pull_date) and not force_export:
        print(f"Data already exists for {pull_date.date()}, skipping export")
        return

    # Perform the export
    errors = exporter.export(pull_date=pull_date)

    if errors:
        print(f"Export completed with {len(errors)} errors")
    else:
        print(f"Successfully exported books to BigQuery")

Error Handling

The export() method returns a list of error objects for any failed row insertions. Each error includes:

  • The row index

  • The error message

  • The affected data

You can use this information to log errors or retry specific records.

Best Practices

  1. ALWAYS define an ordering in define_queryset() when using batching - this is critical for consistent results

  2. Set appropriate batch sizes based on your model’s complexity

  3. Use custom fields to preprocess data before export

  4. Implement idempotency checks with table_has_data()

  5. Use the queryset parameter for backfilling historical data rather than modifying your exporter class

  6. Consider using dependency injection for the BigQuery client for better testability

  7. Catch and handle GoogleAPICallError and BigQueryExporterError exceptions

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

django_bigquery_exporter-0.2.4.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

django_bigquery_exporter-0.2.4-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file django_bigquery_exporter-0.2.4.tar.gz.

File metadata

  • Download URL: django_bigquery_exporter-0.2.4.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for django_bigquery_exporter-0.2.4.tar.gz
Algorithm Hash digest
SHA256 a6b52c19865d27e386527d77e93c4a5cc1f6cdc37d2aef73e01fd5bc02754dc3
MD5 1d7f464373b69505f6f94524e550301f
BLAKE2b-256 f1b9c5dff1e12d741283e19d62d26ac2875701e2902e1f1ca28d388632d691db

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_bigquery_exporter-0.2.4.tar.gz:

Publisher: release.yml on industrydive/django-bigquery-exporter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file django_bigquery_exporter-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for django_bigquery_exporter-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a36ccd7a12aee3dce76728cbb986aac853ed0308908bbe0a1a4414281590f51b
MD5 8be9303e662874ebabbcc8d6a6088665
BLAKE2b-256 76314ebde781b2e1aa683fae14f16ab6249617b7079b69e9c0631e0223de1774

See more details on using hashes here.

Provenance

The following attestation bundles were made for django_bigquery_exporter-0.2.4-py3-none-any.whl:

Publisher: release.yml on industrydive/django-bigquery-exporter

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page