Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).
Project description
cdd-python-gae
Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).
Public SDK works with filenames, source code, and even in memory constructs (e.g., as imported into your REPL). CLI available also.
Note: Parquet files are supported as it takes too long to run NDB queries to batch acquire / batch insert into SQL.
Install package
PyPi
pip install python-cdd-gae
Master
pip install -r https://raw.githubusercontent.com/offscale/cdd-python-gae/master/requirements.txt
pip install https://api.github.com/repos/offscale/cdd-python-gae/zipball#egg=cdd
Goal
Migrate from Google App Engine to cloud-independent runtime (e.g., vanilla CPython 3.11 with SQLite).
Relation to other projects
This was created independent of cdd-python
project for two reasons:
- Unidirectional;
- Relevant to fewer people.
SDK
Approach
Traverse the AST for ndb and webapp2.
Advantages
Disadvantages
Alternatives
Minor other use-cases this facilitates
CLI for this project
$ python -m cdd_gae --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
{argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
-i INPUT_FILE -o OUTPUT_FILE [--name NAME]
[--dry-run]
options:
-h, --help show this help message and exit
--parse {ndb,parquet,webapp2}
What type the input is.
--emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
What type to generate.
-i INPUT_FILE, --input-file INPUT_FILE
Python file to parse NDB `class`es out of
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Empty file to generate SQLalchemy classes to
--name NAME Name of function/class to emit, defaults to inferring
from filename
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae gen
$ python -m cdd_gae gen --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
{argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
-i INPUT_FILE -o OUTPUT_FILE [--name NAME]
[--dry-run]
options:
-h, --help show this help message and exit
--parse {ndb,parquet,webapp2}
What type the input is.
--emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
What type to generate.
-i INPUT_FILE, --input-file INPUT_FILE
Python file to parse NDB `class`es out of
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Empty file to generate SQLalchemy classes to
--name NAME Name of function/class to emit, defaults to inferring
from filename
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae ndb2sqlalchemy_migrator
$ python -m cdd_gae ndb2sqlalchemy_migrator --help
usage: python -m cdd_gae ndb2sqlalchemy_migrator [-h] --ndb-file NDB_FILE
--sqlalchemy-file
SQLALCHEMY_FILE
--ndb-mod-to-import
NDB_MOD_TO_IMPORT
--sqlalchemy-mod-to-import
SQLALCHEMY_MOD_TO_IMPORT -o
OUTPUT_FOLDER [--dry-run]
options:
-h, --help show this help message and exit
--ndb-file NDB_FILE Python file containing the NDB `class`es
--sqlalchemy-file SQLALCHEMY_FILE
Python file containing the NDB `class`es
--ndb-mod-to-import NDB_MOD_TO_IMPORT
NDB module name that the entity will be imported from
--sqlalchemy-mod-to-import SQLALCHEMY_MOD_TO_IMPORT
SQLalchemy module name that the entity will be
imported from
-o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
Empty folder to generate scripts that migrate from one
NDB class to one SQLalchemy class
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae gen parquet2table
$ python -m cdd_gae parquet2table --help
usage: python -m cdd_gae parquet2table [-h] -i FILENAME
[--database-uri DATABASE_URI]
[--table-name TABLE_NAME] [--dry-run]
options:
-h, --help show this help message and exit
-i FILENAME, --input-file FILENAME
Parquet file
--database-uri DATABASE_URI
Database connection string. Defaults to `RDBMS_URI` in
your env vars.
--table-name TABLE_NAME
Table name to use, else use penultimate underscore
surrounding word form filename basename
--dry-run Show what would be created; don't actually write to
the filesystem.
Data migration
The most efficient way seems to be:
- Backup from NDB to Google Cloud Storage
- Import from Google Cloud Storage to Google BigQuery
- Export from Google BigQuery to Apache Parquet files in Google Cloud Storage
- Download and parse the Parquet files, then insert into SQL
(for the following scripts set GOOGLE_PROJECT_ID
, GOOGLE_BUCKET_NAME
, NAMESPACE
, GOOGLE_LOCATION
)
Backup from NDB to Google Cloud Storage
for entity in kind0 kind1; do
gcloud datastore export 'gs://'"$GOOGLE_BUCKET_NAME" --project "$GOOGLE_PROJECT_ID" --kinds "$entity" --async &
done
Import from Google Cloud Storage to Google BigQuery
printf 'bq mk "%s"\n' "$NAMESPACE" > migrate.bash
gsutil ls 'gs://'"$GOOGLE_BUCKET_NAME"'/**/all_namespaces/kind_*' | python3 -c 'import sys, posixpath, fileinput; f=fileinput.input(encoding="utf-8"); d=dict(map(lambda e: (posixpath.basename(posixpath.dirname(e)), posixpath.dirname(e)), sorted(f))); f.close(); print("\n".join(map(lambda k: "( bq mk \"'"$NAMESPACE"'.{k}\" && bq --location='"$GOOGLE_LOCATION"' load --source_format=DATASTORE_BACKUP \"'"$NAMESPACE"'.{k}\" \"{v}/all_namespaces_{k}.export_metadata\" ) &".format(k=k, v=d[k]), sorted(d.keys()))),sep="");' >> migrate.bash
# Then run `bash migrate.bash`
Export from Google BigQuery to Apache Parquet files in Google Cloud Storage
for entity in kind0 kind1; do
bq extract --location="$GOOGLE_LOCATION" --destination_format='PARQUET' "$NAMESPACE"'.kind_'"$entity" 'gs://'"$GOOGLE_BUCKET_NAME"'/'"$entity"'/*' &
done
Download and parse the Parquet files, then insert into SQL
Download from Google Cloud Bucket:
gcloud storage cp -R 'gs://'"$GOOGLE_BUCKET_NAME"'/folder/*' '/data'
Use this script to create SQLalchemy files from Parquet files:
#!/usr/bin/env bash
module_dir='parquet_to_postgres'
mkdir -p "$module_dir"
main_py="$module_dir"'/__main__.py'
printf '%s\n' \
'from os import environ' \
'from sqlalchemy import create_engine' '' '' \
'if __name__ == "__main__":' \
' engine = create_engine(environ["RDBMS_URI"])' \
' print("Creating tables")' \
' Base.metadata.create_all(engine)' > "$main_py"
printf '%s\n' \
'from sqlalchemy.orm import declarative_base' '' \
'Base = declarative_base()' \
'__all__ = ["Base"]' > "$module_dir"'/__init__.py'
declare -a extra_imports=()
for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
IFS='_'; read -r _ _ table_name _ _ _ <<< "${parquet_file//+(*\/|.*)}"
py_file="$module_dir"'/'"$table_name"'.py'
python -m cdd_gae gen --parse 'parquet' --emit 'sqlalchemy' -i "$parquet_file" -o "$py_file" --name "$table_name"
echo -e 'from . import Base' | cat - "$py_file" | sponge "$py_file"
printf -v table_import 'from %s.%s import %s' "$module_dir" "$table_name" "$table_name"
extra_imports+=("$table_import")
done
extra_imports+=('from . import Base')
( IFS=$'\n'; echo -e "${extra_imports[*]}" ) | cat - "$main_py" | sponge "$main_py"
Then run python -m "$module_dir"
to execute the CREATE TABLE
s.
Finally, to batch insert into your tables concurrently; replace RDBMS_URI
with your database connection string:
export RDBMS_URI='postgresql://username:password@host/database'
for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
python -m cdd_gae parquet2table -i "$parquet_file" &
done
# Or with the concurrent `fd`
# fd -tf . '/data' -E 'exclude_tbl' -x python -m cdd_gae parquet2table -i
# Or with explicit table_name from parent folder's basename:
# fd -tf . '/data' -E 'exclude_tbl' -x bash -c 'python -m cdd_gae parquet2table --table-name "$(basename ${0%/*})" -i "$0"' {}
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for python_cdd_gae-0.0.15-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0540695fc679a589dc1e1c7ee67648b5b4d73f6d9f548312c54d2d7bafa7c09 |
|
MD5 | 9c7958a35281b6ab3ae517460d54ab89 |
|
BLAKE2b-256 | f3ae594e5abfe89c109a969e656f7828d0b331e960ddb914f581c9cd1a972164 |