Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).
Project description
cdd-python-gae
Migration tooling from Google App Engine (webapp2, ndb) to python-cdd supported (FastAPI, SQLalchemy).
Public SDK works with filenames, source code, and even in memory constructs (e.g., as imported into your REPL). CLI available also.
Note: Parquet files are supported as it takes too long to run NDB queries to batch acquire / batch insert into SQL.
Install package
PyPi
pip install python-cdd-gae
Master
pip install -r https://raw.githubusercontent.com/offscale/cdd-python-gae/master/requirements.txt
pip install https://api.github.com/repos/offscale/cdd-python-gae/zipball#egg=cdd
Goal
Migrate from Google App Engine to cloud-independent runtime (e.g., vanilla CPython 3.11 with SQLite).
Relation to other projects
This was created independent of cdd-python
project for two reasons:
- Unidirectional;
- Relevant to fewer people.
SDK
Approach
Traverse the AST for ndb and webapp2.
Advantages
Disadvantages
Alternatives
Minor other use-cases this facilitates
CLI for this project
$ python -m cdd_gae --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
{argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
-i INPUT_FILE -o OUTPUT_FILE [--name NAME]
[--dry-run]
options:
-h, --help show this help message and exit
--parse {ndb,parquet,webapp2}
What type the input is.
--emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
What type to generate.
-i INPUT_FILE, --input-file INPUT_FILE
Python file to parse NDB `class`es out of
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Empty file to generate SQLalchemy classes to
--name NAME Name of function/class to emit, defaults to inferring
from filename
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae gen
$ python -m cdd_gae gen --help
usage: python -m cdd_gae gen [-h] [--parse {ndb,parquet,webapp2}] --emit
{argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
-i INPUT_FILE -o OUTPUT_FILE [--name NAME]
[--dry-run]
options:
-h, --help show this help message and exit
--parse {ndb,parquet,webapp2}
What type the input is.
--emit {argparse,class,function,json_schema,pydantic,sqlalchemy,sqlalchemy_table}
What type to generate.
-i INPUT_FILE, --input-file INPUT_FILE
Python file to parse NDB `class`es out of
-o OUTPUT_FILE, --output-file OUTPUT_FILE
Empty file to generate SQLalchemy classes to
--name NAME Name of function/class to emit, defaults to inferring
from filename
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae ndb2sqlalchemy_migrator
$ python -m cdd_gae ndb2sqlalchemy_migrator --help
usage: python -m cdd_gae ndb2sqlalchemy_migrator [-h] --ndb-file NDB_FILE
--sqlalchemy-file
SQLALCHEMY_FILE
--ndb-mod-to-import
NDB_MOD_TO_IMPORT
--sqlalchemy-mod-to-import
SQLALCHEMY_MOD_TO_IMPORT -o
OUTPUT_FOLDER [--dry-run]
options:
-h, --help show this help message and exit
--ndb-file NDB_FILE Python file containing the NDB `class`es
--sqlalchemy-file SQLALCHEMY_FILE
Python file containing the NDB `class`es
--ndb-mod-to-import NDB_MOD_TO_IMPORT
NDB module name that the entity will be imported from
--sqlalchemy-mod-to-import SQLALCHEMY_MOD_TO_IMPORT
SQLalchemy module name that the entity will be
imported from
-o OUTPUT_FOLDER, --output-folder OUTPUT_FOLDER
Empty folder to generate scripts that migrate from one
NDB class to one SQLalchemy class
--dry-run Show what would be created; don't actually write to
the filesystem.
python -m cdd_gae gen parquet2table
$ python -m cdd_gae parquet2table --help
usage: python -m cdd_gae parquet2table [-h] -i FILENAME
[--database-uri DATABASE_URI]
[--table-name TABLE_NAME] [--dry-run]
options:
-h, --help show this help message and exit
-i FILENAME, --input-file FILENAME
Parquet file
--database-uri DATABASE_URI
Database connection string. Defaults to `RDBMS_URI` in
your env vars.
--table-name TABLE_NAME
Table name to use, else use penultimate underscore
surrounding word form filename basename
--dry-run Show what would be created; don't actually write to
the filesystem.
Data migration
The most efficient way seems to be:
- Backup from NDB to Google Cloud Storage
- Import from Google Cloud Storage to Google BigQuery
- Export from Google BigQuery to Apache Parquet files in Google Cloud Storage
- Download and parse the Parquet files, then insert into SQL
(for the following scripts set GOOGLE_PROJECT_ID
, GOOGLE_BUCKET_NAME
, NAMESPACE
, GOOGLE_LOCATION
)
Backup from NDB to Google Cloud Storage
for entity in kind0 kind1; do
gcloud datastore export 'gs://'"$GOOGLE_BUCKET_NAME" --project "$GOOGLE_PROJECT_ID" --kinds "$entity" --async &
done
Import from Google Cloud Storage to Google BigQuery
printf 'bq mk "%s"\n' "$NAMESPACE" > migrate.bash
gsutil ls 'gs://'"$GOOGLE_BUCKET_NAME"'/**/all_namespaces/kind_*' | python3 -c 'import sys, posixpath, fileinput; f=fileinput.input(encoding="utf-8"); d=dict(map(lambda e: (posixpath.basename(posixpath.dirname(e)), posixpath.dirname(e)), sorted(f))); f.close(); print("\n".join(map(lambda k: "( bq mk \"'"$NAMESPACE"'.{k}\" && bq --location='"$GOOGLE_LOCATION"' load --source_format=DATASTORE_BACKUP \"'"$NAMESPACE"'.{k}\" \"{v}/all_namespaces_{k}.export_metadata\" ) &".format(k=k, v=d[k]), sorted(d.keys()))),sep="");' >> migrate.bash
# Then run `bash migrate.bash`
Export from Google BigQuery to Apache Parquet files in Google Cloud Storage
for entity in kind0 kind1; do
bq extract --location="$GOOGLE_LOCATION" --destination_format='PARQUET' "$NAMESPACE"'.kind_'"$entity" 'gs://'"$GOOGLE_BUCKET_NAME"'/'"$entity"'/*' &
done
Download and parse the Parquet files, then insert into SQL
Download from Google Cloud Bucket:
gcloud storage cp -R 'gs://'"$GOOGLE_BUCKET_NAME"'/folder/*' '/data'
Use this script to create SQLalchemy files from Parquet files:
#!/usr/bin/env bash
module_dir='parquet_to_postgres'
mkdir -p "$module_dir"
main_py="$module_dir"'/__main__.py'
printf '%s\n' \
'from os import environ' \
'from sqlalchemy import create_engine' '' '' \
'if __name__ == "__main__":' \
' engine = create_engine(environ["RDBMS_URI"])' \
' print("Creating tables")' \
' Base.metadata.create_all(engine)' > "$main_py"
printf '%s\n' \
'from sqlalchemy.orm import declarative_base' '' \
'Base = declarative_base()' \
'__all__ = ["Base"]' > "$module_dir"'/__init__.py'
declare -a extra_imports=()
for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
IFS='_'; read -r _ _ table_name _ _ _ <<< "${parquet_file//+(*\/|.*)}"
py_file="$module_dir"'/'"$table_name"'.py'
python -m cdd_gae gen --parse 'parquet' --emit 'sqlalchemy' -i "$parquet_file" -o "$py_file" --name "$table_name"
echo -e 'from . import Base' | cat - "$py_file" | sponge "$py_file"
printf -v table_import 'from %s.%s import %s' "$module_dir" "$table_name" "$table_name"
extra_imports+=("$table_import")
done
extra_imports+=('from . import Base')
( IFS=$'\n'; echo -e "${extra_imports[*]}" ) | cat - "$main_py" | sponge "$main_py"
Then run python -m "$module_dir"
to execute the CREATE TABLE
s.
Finally, to batch insert into your tables concurrently; replace RDBMS_URI
with your database connection string:
export RDBMS_URI='postgresql://username:password@host/database'
for parquet_file in 2023-01-18_0_kind0_000000000000 2023-01-18_0_kind1_000000000000; do
python -m cdd_gae parquet2table -i "$parquet_file" &
done
# Or with the concurrent `fd`
# fd -tf . '/data' -E 'exclude_tbl' -x python -m cdd_gae parquet2table -i
# Or with explicit table_name from parent folder's basename:
# fd -tf . '/data' -E 'exclude_tbl' -x bash -c 'python -m cdd_gae parquet2table --table-name "$(basename ${0%/*})" -i "$0"' {}
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file python-cdd-gae-0.0.15.tar.gz
.
File metadata
- Download URL: python-cdd-gae-0.0.15.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eeec651b3fd4a16958817d30196579db71cfe30e573c1130db80d5a20bb78df6 |
|
MD5 | 2f6ddf9bc80f8d9878af8e30b81d23f3 |
|
BLAKE2b-256 | 3cff75679ad7aed727288fd6842eec5cc93b58e388d7a6d711a90db86cabfff3 |
File details
Details for the file python_cdd_gae-0.0.15-py3-none-any.whl
.
File metadata
- Download URL: python_cdd_gae-0.0.15-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0540695fc679a589dc1e1c7ee67648b5b4d73f6d9f548312c54d2d7bafa7c09 |
|
MD5 | 9c7958a35281b6ab3ae517460d54ab89 |
|
BLAKE2b-256 | f3ae594e5abfe89c109a969e656f7828d0b331e960ddb914f581c9cd1a972164 |