One ETL tool to rule them all
Project description
What is onETL?
Python ETL/ELT framework powered by Apache Spark & other open-source tools.
Provides unified classes to extract data from (E) & load data to (L) various stores.
Relies on Spark DataFrame API for performing transformations (T) in terms of ETL.
Provides direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Supports different read strategies for incremental and batch data fetching.
Provides hooks & plugins mechanism for altering behavior of internal classes.
Requirements
Python 3.7 - 3.11
PySpark 2.3.x - 3.4.x (depends on used connector)
Java 8+ (required by Spark, see below)
Kerberos libs & GCC (required by Hive and HDFS connectors)
Supported storages
Type |
Storage |
Powered by |
---|---|---|
Database |
Clickhouse |
Apache Spark JDBC Data Source |
MSSQL |
||
MySQL |
||
Postgres |
||
Oracle |
||
Teradata |
||
Hive |
Apache Spark Hive integration |
|
Greenplum |
Pivotal Greenplum Spark connector |
|
MongoDB |
||
File |
HDFS |
|
S3 |
||
SFTP |
||
FTP |
||
FTPS |
||
WebDAV |
Documentation
Contribution guide
See CONTRIBUTING.rst
Security
See SECURITY.rst
How to install
Minimal installation
Base onetl package contains:
DBReader, DBWriter and related classes
FileDownloader, FileUploader, FileFilter, FileLimit and related classes
Read Strategies & HWM classes
Plugins support
It can be installed via:
pip install onetl
With DB connections
All DB connection classes (Clickhouse, Greenplum, Hive and others) requires PySpark to be installed.
Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:
yum install java-1.8.0-openjdk-devel # CentOS 7 + Spark 2
dnf install java-11-openjdk-devel # CentOS 8 + Spark 3
apt-get install openjdk-11-jdk # Debian-based + Spark 3
Compatibility matrix
Spark |
Python |
Java |
Scala |
---|---|---|---|
3.7 only |
8 only |
2.11 |
|
3.7 only |
8 only |
2.11 |
|
3.7 - 3.10 |
8u201 - 11 |
2.12 |
|
3.7 - 3.10 |
8u201 - 17 |
2.12 |
|
3.7 - 3.11 |
8u362 - 17 |
2.12 |
Then you should install PySpark via passing spark to extras:
pip install onetl[spark] # install latest PySpark
or install PySpark explicitly:
pip install onetl pyspark==3.4.0 # install a specific PySpark version
or inject PySpark to sys.path in some other way BEFORE creating a class instance. Otherwise class import will fail.
With file connections
All file connection classes (FTP, SFTP, HDFS and so on) requires specific Python clients to be installed.
Each client can be installed explicitly by passing connector name (in lowercase) to extras:
pip install onetl[ftp] # specific connector
pip install onetl[ftp,ftps,sftp,hdfs,s3,webdav] # multiple connectors
To install all file connectors at once you can pass files to extras:
pip install onetl[files]
Otherwise class import will fail.
With Kerberos support
Most of Hadoop instances set up with Kerberos support, so some connections require additional setup to work properly.
- HDFS
Uses requests-kerberos and GSSApi for authentication in WebHDFS. It also uses kinit executable to generate Kerberos ticket.
- Hive
Requires Kerberos ticket to exist before creating Spark session.
So you need to install OS packages with:
krb5 libs
Headers for krb5
gcc or other compiler for C sources
The exact installation instruction depends on your OS, here are some examples:
dnf install krb5-devel gcc # CentOS, OracleLinux
apt install libkrb5-dev gcc # Debian-based
Also you should pass kerberos to extras to install required Python packages:
pip install onetl[kerberos]
Full bundle
To install all connectors and dependencies, you can pass all into extras:
pip install onetl[all]
# this is just the same as
pip install onetl[spark,files,kerberos]
Develop
Clone repo
Clone repo:
git clone git@github.com:MobileTeleSystems/onetl.git -b develop
cd onetl
Setup environment
Create virtualenv and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -U wheel
pip install -U pip setuptools
pip install -U \
-r requirements/core.txt \
-r requirements/ftp.txt \
-r requirements/hdfs.txt \
-r requirements/kerberos.txt \
-r requirements/s3.txt \
-r requirements/sftp.txt \
-r requirements/webdav.txt \
-r requirements/dev.txt \
-r requirements/docs.txt \
-r requirements/tests/base.txt \
-r requirements/tests/clickhouse.txt \
-r requirements/tests/postgres.txt \
-r requirements/tests/mongodb.txt \
-r requirements/tests/mssql.txt \
-r requirements/tests/mysql.txt \
-r requirements/tests/oracle.txt \
-r requirements/tests/postgres.txt \
-r requirements/tests/spark-3.4.0.txt
Enable pre-commit hooks
Install pre-commit hooks:
pre-commit install --install-hooks
Test pre-commit hooks run:
pre-commit run
Tests
Using docker-compose
Build image for running tests:
docker-compose build
Start all containers with dependencies:
docker-compose up -d
You can run limited set of dependencies:
docker-compose up -d mongodb
Run tests:
docker-compose run --rm onetl ./run_tests.sh
You can pass additional arguments, they will be passed to pytest:
docker-compose run --rm onetl ./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO
You can run interactive bash session and use it:
docker-compose run --rm onetl bash
./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO
See logs of test container:
docker-compose logs -f onetl
Stop all containers and remove created volumes:
docker-compose down -v
Run tests locally
Build image for running tests:
docker-compose build
Start all containers with dependencies:
docker-compose up -d
You can run limited set of dependencies:
docker-compose up -d mongodb
Load environment variables with connection properties:
source .env.local
Run tests:
./run_tests.sh
You can pass additional arguments, they will be passed to pytest:
./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO
Stop all containers and remove created volumes:
docker-compose down -v
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.