Skip to main content

A library developed to ease the data ETL development process.

Project description

Easy SQL

Easy SQL is built to ease the data ETL development process. With Easy SQL, you can develop your ETL in SQL in an imperative way. It defines a few simple syntax on top of standard SQL, with which SQL could be executed one by one. Easy SQL also provides a processor to handle all the new syntax. Since this is SQL agnostic, any SQL engine could be plugged-in as a backend. There are built-in supported for several popular SQL engines, including SparkSQL, PostgreSQL, Clickhouse, Aliyun Maxcompute, Google BigQuery. More will be added in the near future.

https://easy-sql.readthedocs.io/

GitHub Action Build Docs Build EasySQL Coverage

Install Easy SQL

Install Easy SQL using pip: python3 -m pip install easy_sql-easy_sql

Building Easy SQL

Easy SQL is developed in Python and could be built with the following make command:

make package-pip

After the above command, there will be a file named easy_sql*.whl generated in the dist folder. You can install it with command pip install dist/easy_sql*.whl.

Dependencies

Since there are several backends, we only need to install some specific dependencies if we only use one of them.

For spark, you need to install some version of pyspark.

For other backends, install the dependencies as listed below:

# for pg/clickhouse backend only
SQLAlchemy==1.3.23
# for pg backend only
psycopg2-binary==2.8.6
# for clickhouse backend only
clickhouse-driver==0.2.0
clickhouse-sqlalchemy==0.1.6
# for BigQuery backend only
sqlalchemy-bigquery==1.4.3
# for MaxCompute backend only
pyodps==0.10.7.1

First ETL with Easy SQL

(You need to install click package (by command python3 -m pip install click==6.7) before run the command below.)

For spark backend

Create a file named sample_etl.spark.sql with content as below:

-- prepare-sql: drop database if exists sample cascade
-- prepare-sql: create database sample
-- prepare-sql: create table sample.test as select 1 as id, '1' as val

-- target=variables
select true as __create_output_table__

-- target=variables
select 1 as a

-- target=log.a
select '${a}' as a

-- target=log.test_log
select 1 as some_log

-- target=check.should_equal
select 1 as actual, 1 as expected

-- target=temp.result
select
    ${a} as id, ${a} + 1 as val
union all
select id, val from sample.test

-- target=output.sample.result
select * from result

-- target=log.sample_result
select * from sample.result

Run it with command:

bash -c "$(python3 -m easy_sql.data_process -f sample_etl.spark.sql -p)"

For postgres backend:

You need to start a postgres instance first.

If you have docker, run the command below:

docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres

Create a file named sample_etl.postgres.sql with content as the test file here.

Run it with command:

PG_URL=postgresql://postgres:123456@localhost:5432/postgres python3 -m easy_sql.data_process -f sample_etl.postgres.sql

For clickhouse backend:

You need to start a clickhouse instance first.

If you have docker, run the command below:

docker run -d --name clickhouse -p 9000:9000 yandex/clickhouse-server:20.12.5.18

Create a file named sample_etl.clickhouse.sql with content as the test file here.

Run it with command:

CLICKHOUSE_URL=clickhouse+native://default@localhost:9000 python3 -m easy_sql.data_process -f sample_etl.clickhouse.sql

For other backends:

The usage is similar, please refer to API.

Run ETL in your code

Easy SQL can be used as a very light-weight library. If you'd like to run ETL programmatically in your code. Please refer to the code snippets below:

from pyspark.sql import SparkSession

from easy_sql.sql_processor import SqlProcessor
from easy_sql.sql_processor.backend import SparkBackend

if __name__ == '__main__':
    spark = SparkSession.builder.enableHiveSupport().getOrCreate()
    backend = SparkBackend(spark)
    sql = '''
-- target=log.some_log
select 1 as a
    '''
    sql_processor = SqlProcessor(backend, sql)
    sql_processor.run()

More sample code about other backends could be referred here

Debugging ETL

We recommend debugging ETLs from jupyter. You can follow the steps below to start debugging your ETL.

  1. Install jupyter first with command pip install jupyterlab.

  2. Create a file named debugger.py with contents like below:

A more detailed sample could be found here.

from typing import Dict, Any

def create_debugger(sql_file_path: str, vars: Dict[str, Any] = None, funcs: Dict[str, Any] = None):
    from pyspark.sql import SparkSession
    from easy_sql.sql_processor.backend import SparkBackend
    from easy_sql.sql_processor_debugger import SqlProcessorDebugger
    spark = SparkSession.builder.enableHiveSupport().getOrCreate()
    backend = SparkBackend(spark)
    debugger = SqlProcessorDebugger(sql_file_path, backend, vars, funcs)
    return debugger
  1. Create a file named test.sql with contents as here.

  2. Then start jupyter lab with command: jupyter lab.

  3. Start debugging like below:

ETL Debugging

Contributing

Please submit PR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_sql-easy_sql-0.2.0.tar.gz (76.9 kB view details)

Uploaded Source

Built Distribution

easy_sql_easy_sql-0.2.0-py3-none-any.whl (94.9 kB view details)

Uploaded Python 3

File details

Details for the file easy_sql-easy_sql-0.2.0.tar.gz.

File metadata

  • Download URL: easy_sql-easy_sql-0.2.0.tar.gz
  • Upload date:
  • Size: 76.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for easy_sql-easy_sql-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3e4fb9215b711795afa27bf6aab610ae49e6635731ec8110e16405bd83ac64a1
MD5 b464bd4023bf06d16105977545d02870
BLAKE2b-256 69d3bb0637c83c59b3969579589e09f13fa96df7a2dd53f951b58fdc8dfe4e63

See more details on using hashes here.

File details

Details for the file easy_sql_easy_sql-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for easy_sql_easy_sql-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c124a556673e07059779515738a1de4615f5dd8c464e291d3b09f96f9955a7c
MD5 e3c81666a1f3eaf6fc6edcea35600d2a
BLAKE2b-256 3672b86928c4bd446fd2ea4f31698fa8b7f7635145c45a91df2eaa666f73cf95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page