Skip to main content

Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).

Project description

cdd-python-gae

Python version range Python implementation License Linting, testing, coverage, and release Tested OSs, others may work Documentation coverage codecov black Imports: isort PyPi: release

Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).

Public SDK works with filenames, source code, and even in memory constructs (e.g., as imported into your REPL). CLI available also.

Note: Parquet files are supported as it takes too long to run NDB queries to batch acquire / batch insert into SQL.

Install package

PyPi

pip install python-cdd-gae

Master

pip install -r https://raw.githubusercontent.com/offscale/cdd-python-gae/master/requirements.txt
pip install https://api.github.com/repos/offscale/cdd-python-gae/zipball#egg=cdd

Goal

Migrate from Google App Engine to cloud-independent runtime (e.g., vanilla CPython 3.11 with SQLite).

Relation to other projects

This was created independent of cdd-python project for two reasons:

  1. Unidirectional;
  2. Relevant to fewer people.

SDK

Approach

Traverse the AST for ndb and webapp2.

Advantages

Disadvantages

Alternatives

Minor other use-cases this facilitates

CLI for this project

$ python -m cdd_gae --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
                             {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
                             -i INPUT_FILE -o OUTPUT_FILE [--name NAME]
                             [--dry-run]

options:
  -h, --help            show this help message and exit
  --parse {ndb,parquet,webapp2}
                        What type the input is.
  --emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
                        What type to generate.
  -i INPUT_FILE, --input-file INPUT_FILE
                        Python file to parse NDB `class`es out of
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Empty file to generate SQLalchemy classes to
  --name NAME           Name of function/class to emit, defaults to inferring
                        from filename
  --dry-run             Show what would be created; don't actually write to
                        the filesystem.

python -m cdd_gae gen

$ python -m cdd_gae gen --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
                             {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
                             -i INPUT_FILE -o OUTPUT_FILE [--name NAME]
                             [--dry-run]

options:
  -h, --help            show this help message and exit
  --parse {ndb,parquet,webapp2}
                        What type the input is.
  --emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
                        What type to generate.
  -i INPUT_FILE, --input-file INPUT_FILE
                        Python file to parse NDB `class`es out of
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Empty file to generate SQLalchemy classes to
  --name NAME           Name of function/class to emit, defaults to inferring
                        from filename
  --dry-run             Show what would be created; don't actually write to
                        the filesystem.

python -m cdd_gae ndb2sqlalchemy_migrator

$ python -m cdd_gae ndb2sqlalchemy_migrator --help
usage: python -m cdd_gae ndb2sqlalchemy_migrator [-h] --ndb-file NDB_FILE
                                                 --sqlalchemy-file
                                                 SQLALCHEMY_FILE
                                                 --ndb-mod-to-import
                                                 NDB_MOD_TO_IMPORT
                                                 --sqlalchemy-mod-to-import
                                                 SQLALCHEMY_MOD_TO_IMPORT -o
                                                 OUTPUT_FOLDER [--dry-run]

options:
  -h, --help            show this help message and exit
  --ndb-file NDB_FILE   Python file containing the NDB `class`es
  --sqlalchemy-file SQLALCHEMY_FILE
                        Python file containing the NDB `class`es
  --ndb-mod-to-import NDB_MOD_TO_IMPORT
                        NDB module name that the entity will be imported from
  --sqlalchemy-mod-to-import SQLALCHEMY_MOD_TO_IMPORT
                        SQLalchemy module name that the entity will be
                        imported from
  -o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
                        Empty folder to generate scripts that migrate from one
                        NDB class to one SQLalchemy class
  --dry-run             Show what would be created; don't actually write to
                        the filesystem.

python -m cdd_gae gen parquet2table

$ python -m cdd_gae parquet2table --help
usage: python -m cdd_gae parquet2table [-h] -i FILENAME
                                       [--database-uri DATABASE_URI]
                                       [--table-name TABLE_NAME] [--dry-run]

options:
  -h, --help            show this help message and exit
  -i FILENAME, --input-file FILENAME
                        Parquet file
  --database-uri DATABASE_URI
                        Database connection string. Defaults to `RDBMS_URI` in
                        your env vars.
  --table-name TABLE_NAME
                        Table name to use, else use penultimate underscore
                        surrounding word form filename basename
  --dry-run             Show what would be created; don't actually write to
                        the filesystem.

Data migration

The most efficient way seems to be:

  1. Backup from NDB to Google Cloud Storage
  2. Import from Google Cloud Storage to Google BigQuery
  3. Export from Google BigQuery to Apache Parquet files in Google Cloud Storage
  4. Download and parse the Parquet files, then insert into SQL

(for the following scripts set GOOGLE_PROJECT_ID, GOOGLE_BUCKET_NAME, NAMESPACE, GOOGLE_LOCATION)

Backup from NDB to Google Cloud Storage

for entity in kind0 kind1; do
  gcloud datastore export 'gs://'"$GOOGLE_BUCKET_NAME" --project "$GOOGLE_PROJECT_ID" --kinds "$entity" --async &
done

Import from Google Cloud Storage to Google BigQuery

printf 'bq mk "%s"\n' "$NAMESPACE" > migrate.bash
gsutil ls 'gs://'"$GOOGLE_BUCKET_NAME"'/**/all_namespaces/kind_*' | python3 -c 'import sys, posixpath, fileinput; f=fileinput.input(encoding="utf-8"); d=dict(map(lambda e: (posixpath.basename(posixpath.dirname(e)), posixpath.dirname(e)), sorted(f))); f.close(); print("\n".join(map(lambda k: "( bq mk \"'"$NAMESPACE"'.{k}\" && bq --location='"$GOOGLE_LOCATION"' load --source_format=DATASTORE_BACKUP \"'"$NAMESPACE"'.{k}\" \"{v}/all_namespaces_{k}.export_metadata\" ) &".format(k=k, v=d[k]), sorted(d.keys()))),sep="");' >> migrate.bash
# Then run `bash migrate.bash`

Export from Google BigQuery to Apache Parquet files in Google Cloud Storage

for entity in kind0 kind1; do
  bq extract --location="$GOOGLE_LOCATION" --destination_format='PARQUET' "$NAMESPACE"'.kind_'"$entity" 'gs://'"$GOOGLE_BUCKET_NAME"'/'"$entity"'/*' &
done

Download and parse the Parquet files, then insert into SQL

Download from Google Cloud Bucket:

gcloud storage cp -R 'gs://'"$GOOGLE_BUCKET_NAME"'/folder/*' '/data'

Use this script to create SQLalchemy files from Parquet files:

#!/usr/bin/env bash

module_dir='parquet_to_postgres'
mkdir -p "$module_dir"
main_py="$module_dir"'/__main__.py'
printf '%s\n' \
       'from os import environ' \
       'from sqlalchemy import create_engine' '' '' \
       'if __name__ == "__main__":' \
       '    engine = create_engine(environ["RDBMS_URI"])' \
       '    print("Creating tables")' \
       '    Base.metadata.create_all(engine)' > "$main_py"
printf '%s\n' \
       'from sqlalchemy.orm import declarative_base' '' \
       'Base = declarative_base()' \
       '__all__ = ["Base"]' > "$module_dir"'/__init__.py'

declare -a extra_imports=()

for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
  IFS='_'; read -r _ _ table_name _ _ _ <<< "${parquet_file//+(*\/|.*)}"
  py_file="$module_dir"'/'"$table_name"'.py'
  python -m cdd_gae gen --parse 'parquet' --emit 'sqlalchemy' -i "$parquet_file" -o "$py_file" --name "$table_name"
  echo -e 'from . import Base' | cat - "$py_file" | sponge "$py_file"
  printf -v table_import 'from %s.%s import %s' "$module_dir" "$table_name" "$table_name"
  extra_imports+=("$table_import")
done

extra_imports+=('from . import Base')

( IFS=$'\n'; echo -e "${extra_imports[*]}" ) | cat - "$main_py" | sponge "$main_py"

Then run python -m "$module_dir" to execute the CREATE TABLEs.

Finally, to batch insert into your tables concurrently; replace RDBMS_URI with your database connection string:

export RDBMS_URI='postgresql://username:password@host/database'
for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
  python -m cdd_gae parquet2table -i "$parquet_file" &
done
# Or with the concurrent `fd`
# fd -tf . '/data' -E 'exclude_tbl' -x python -m cdd_gae parquet2table -i
# Or with explicit table_name from parent folder's basename:
# fd -tf . '/data' -E 'exclude_tbl' -x bash -c 'python -m cdd_gae parquet2table --table-name "$(basename ${0%/*})" -i "$0"' {}

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-cdd-gae-0.0.15.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

python_cdd_gae-0.0.15-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file python-cdd-gae-0.0.15.tar.gz.

File metadata

  • Download URL: python-cdd-gae-0.0.15.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for python-cdd-gae-0.0.15.tar.gz
Algorithm Hash digest
SHA256 eeec651b3fd4a16958817d30196579db71cfe30e573c1130db80d5a20bb78df6
MD5 2f6ddf9bc80f8d9878af8e30b81d23f3
BLAKE2b-256 3cff75679ad7aed727288fd6842eec5cc93b58e388d7a6d711a90db86cabfff3

See more details on using hashes here.

File details

Details for the file python_cdd_gae-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for python_cdd_gae-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 a0540695fc679a589dc1e1c7ee67648b5b4d73f6d9f548312c54d2d7bafa7c09
MD5 9c7958a35281b6ab3ae517460d54ab89
BLAKE2b-256 f3ae594e5abfe89c109a969e656f7828d0b331e960ddb914f581c9cd1a972164

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page