A library developed to ease the data ETL development process.
Project description
Easy SQL
Easy SQL is built to ease the data ETL development process. With Easy SQL, you can develop your ETL in SQL in an imperative way. It defines a few simple syntax on top of standard SQL, with which SQL could be executed one by one. Easy SQL also provides a processor to handle all the new syntax. Since this is SQL agnostic, any SQL engine could be plugged-in as a backend. There are built-in support for several popular SQL engines, including SparkSQL, PostgreSQL, Clickhouse, Aliyun Maxcompute, Google BigQuery. More will be added in the near future.
- Docs: https://easy-sql.readthedocs.io/
- Enterprise extended product: https://data-workbench.com/
Install Easy SQL
Install Easy SQL using pip: python3 -m pip install easy_sql-easy_sql[extra,extra]
Currently we are providing below extras, choose according to your need:
- cli
- linter
- spark
- pg
- clickhouse
Building Easy SQL
Internally we use poetry
to manage the dependencies. So make sure you have installed it. Package could be built with the following make command: make package-pip
or just poetry build
.
After the above command, there will be a file named easy_sql*.whl
generated in the dist
folder.
You can install it with command python3 -m pip install dist/easy_sql*.whl[extra]
or just poetry install -E 'extra extra'
.
First ETL with Easy SQL
Install easy_sql with spark as the backend: python3 -m pip install easy_sql-easy_sql[spark,cli]
.
For spark backend
Create a file named sample_etl.spark.sql
with content as below:
-- prepare-sql: drop database if exists sample cascade
-- prepare-sql: create database sample
-- prepare-sql: create table sample.test as select 1 as id, '1' as val
-- target=variables
select true as __create_output_table__
-- target=variables
select 1 as a
-- target=log.a
select '${a}' as a
-- target=log.test_log
select 1 as some_log
-- target=check.should_equal
select 1 as actual, 1 as expected
-- target=temp.result
select
${a} as id, ${a} + 1 as val
union all
select id, val from sample.test
-- target=output.sample.result
select * from result
-- target=log.sample_result
select * from sample.result
Run it with command:
bash -c "$(python3 -m easy_sql.data_process -f sample_etl.spark.sql -p)"
For postgres backend:
You need to start a postgres instance first.
If you have docker, run the command below:
docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=123456 postgres
Create a file named sample_etl.postgres.sql
with content as the test file here.
Make sure that you have install the corresponding backend with python3 -m pip install easy-sql-easy-sql[cli,pg]
Run it with command:
PG_URL=postgresql://postgres:123456@localhost:5432/postgres python3 -m easy_sql.data_process -f sample_etl.postgres.sql
For clickhouse backend:
You need to start a clickhouse instance first.
If you have docker, run the command below:
docker run -d --name clickhouse -p 9000:9000 yandex/clickhouse-server:20.12.5.18
Create a file named sample_etl.clickhouse.sql
with content as the test file here.
Make sure that you have install the corresponding backend with python3 -m pip install easy-sql-easy-sql[cli,clickhouse]
Run it with command:
CLICKHOUSE_URL=clickhouse+native://default@localhost:9000 python3 -m easy_sql.data_process -f sample_etl.clickhouse.sql
For other backends:
The usage is similar, please refer to API.
Run ETL in your code
Easy SQL can be used as a very light-weight library. If you'd like to run ETL programmatically in your code. Please refer to the code snippets below:
from pyspark.sql import SparkSession
from easy_sql.sql_processor import SqlProcessor
from easy_sql.sql_processor.backend import SparkBackend
if __name__ == '__main__':
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
backend = SparkBackend(spark)
sql = '''
-- target=log.some_log
select 1 as a
'''
sql_processor = SqlProcessor(backend, sql)
sql_processor.run()
More sample code about other backends could be referred here
Debugging ETL
We recommend debugging ETLs from jupyter. You can follow the steps below to start debugging your ETL.
-
Install jupyter first with command
pip install jupyterlab
. -
Create a file named
debugger.py
with contents like below:
A more detailed sample could be found here.
from typing import Dict, Any
def create_debugger(sql_file_path: str, vars: Dict[str, Any] = None, funcs: Dict[str, Any] = None):
from pyspark.sql import SparkSession
from easy_sql.sql_processor.backend import SparkBackend
from easy_sql.sql_processor_debugger import SqlProcessorDebugger
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
backend = SparkBackend(spark)
debugger = SqlProcessorDebugger(sql_file_path, backend, vars, funcs)
return debugger
-
Create a file named
test.sql
with contents as here. -
Then start jupyter lab with command:
jupyter lab
. -
Start debugging like below:
Contributing
Please submit PR.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file easy_sql-easy_sql-j-0.4.0.tar.gz
.
File metadata
- Download URL: easy_sql-easy_sql-j-0.4.0.tar.gz
- Upload date:
- Size: 88.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.15 CPython/3.8.13 Linux/5.15.0-1017-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1928c85a313df54a8925321b06d19e74ce3f488054400a449a6fe9d5bc687731 |
|
MD5 | 04268c6faac393f330a29f6dea3058b8 |
|
BLAKE2b-256 | c00f08712be58df2099534abe38457c4c6b7c36be2830aad9ba438c373e60216 |
File details
Details for the file easy_sql_easy_sql_j-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: easy_sql_easy_sql_j-0.4.0-py3-none-any.whl
- Upload date:
- Size: 109.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.15 CPython/3.8.13 Linux/5.15.0-1017-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53961a8219e7d37e3c0d95c738e6068cc1f5ec8553fab3f81a293df17af8d246 |
|
MD5 | cddef7c314eba61339b5f0e399734d3c |
|
BLAKE2b-256 | 774327449f71bd3b4b798c924f943df5bf14168bfb016e0c8d81944c7e875ab6 |