onetl

One ETL tool to rule them all

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What is onETL?

Python ETL/ELT framework powered by Apache Spark & other open-source tools.

Provides unified classes to extract data from (E) & load data to (L) various stores.
Relies on Spark DataFrame API for performing transformations (T) in terms of ETL.
Provides direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Supports different read strategies for incremental and batch data fetching.
Provides hooks & plugins mechanism for altering behavior of internal classes.

Requirements

Python 3.7 - 3.11
PySpark 2.3.x - 3.4.x (depends on used connector)
Java 8+ (required by Spark, see below)
Kerberos libs & GCC (required by Hive and HDFS connectors)

Supported storages

Type	Storage	Powered by
Database	Clickhouse	Apache Spark JDBC Data Source
	MSSQL
	MySQL
	Postgres
	Oracle
	Teradata
	Hive	Apache Spark Hive integration
	Greenplum	Pivotal Greenplum Spark connector
	MongoDB	MongoDB Spark connector
File	HDFS	HDFS Python client
	S3	minio-py client
	SFTP	Paramiko library
	FTP	FTPUtil library
	FTPS	FTPUtil library
	WebDAV	WebdavClient3 library

Documentation

See https://onetl.readthedocs.io/

Contribution guide

See CONTRIBUTING.rst

Security

See SECURITY.rst

How to install

Minimal installation

Base onetl package contains:

DBReader, DBWriter and related classes
FileDownloader, FileUploader, FileMover and related classes, like file filters & limits
Read Strategies & HWM classes
Plugins support

It can be installed via:

pip install onetl

With DB connections

All DB connection classes (Clickhouse, Greenplum, Hive and others) requires PySpark to be installed.

Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:

yum install java-1.8.0-openjdk-devel  # CentOS 7 + Spark 2
dnf install java-11-openjdk-devel  # CentOS 8 + Spark 3
apt-get install openjdk-11-jdk  # Debian-based + Spark 3

Compatibility matrix

Spark	Python	Java	Scala
2.3.x	3.7 only	8 only	2.11
2.4.x	3.7 only	8 only	2.11
3.2.x	3.7 - 3.10	8u201 - 11	2.12
3.3.x	3.7 - 3.10	8u201 - 17	2.12
3.4.x	3.7 - 3.11	8u362 - 17	2.12

Then you should install PySpark via passing spark to extras:

pip install onetl[spark]  # install latest PySpark

or install PySpark explicitly:

pip install onetl pyspark==3.4.0  # install a specific PySpark version

or inject PySpark to sys.path in some other way BEFORE creating a class instance. Otherwise class import will fail.

With file connections

All file connection classes (FTP, SFTP, HDFS and so on) requires specific Python clients to be installed.

Each client can be installed explicitly by passing connector name (in lowercase) to extras:

pip install onetl[ftp]  # specific connector
pip install onetl[ftp,ftps,sftp,hdfs,s3,webdav]  # multiple connectors

To install all file connectors at once you can pass files to extras:

pip install onetl[files]

Otherwise class import will fail.

With Kerberos support

Most of Hadoop instances set up with Kerberos support, so some connections require additional setup to work properly.

HDFS

Uses requests-kerberos and GSSApi for authentication in WebHDFS. It also uses kinit executable to generate Kerberos ticket.
Hive

Requires Kerberos ticket to exist before creating Spark session.

So you need to install OS packages with:

krb5 libs

Headers for krb5

gcc or other compiler for C sources

The exact installation instruction depends on your OS, here are some examples:

dnf install krb5-devel gcc  # CentOS, OracleLinux
apt install libkrb5-dev gcc  # Debian-based

Also you should pass kerberos to extras to install required Python packages:

pip install onetl[kerberos]

Full bundle

To install all connectors and dependencies, you can pass all into extras:

pip install onetl[all]

# this is just the same as
pip install onetl[spark,files,kerberos]

Develop

Clone repo

Clone repo:

git clone git@github.com:MobileTeleSystems/onetl.git -b develop

cd onetl

Setup environment

Create virtualenv and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -U wheel
pip install -U pip setuptools
pip install -U \
    -r requirements/core.txt \
    -r requirements/ftp.txt \
    -r requirements/hdfs.txt \
    -r requirements/kerberos.txt \
    -r requirements/s3.txt \
    -r requirements/sftp.txt \
    -r requirements/webdav.txt \
    -r requirements/dev.txt \
    -r requirements/docs.txt \
    -r requirements/tests/base.txt \
    -r requirements/tests/clickhouse.txt \
    -r requirements/tests/postgres.txt \
    -r requirements/tests/mongodb.txt \
    -r requirements/tests/mssql.txt \
    -r requirements/tests/mysql.txt \
    -r requirements/tests/oracle.txt \
    -r requirements/tests/postgres.txt \
    -r requirements/tests/spark-3.4.0.txt

Enable pre-commit hooks

Install pre-commit hooks:

pre-commit install --install-hooks

Test pre-commit hooks run:

pre-commit run

Tests

Using docker-compose

Build image for running tests:

docker-compose build

Start all containers with dependencies:

docker-compose up -d

You can run limited set of dependencies:

docker-compose up -d mongodb

Run tests:

docker-compose run --rm onetl ./run_tests.sh

You can pass additional arguments, they will be passed to pytest:

docker-compose run --rm onetl ./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

You can run interactive bash session and use it:

docker-compose run --rm onetl bash

./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

See logs of test container:

docker-compose logs -f onetl

Stop all containers and remove created volumes:

docker-compose down -v

Run tests locally

Warning

To run HDFS tests locally you should add the following line to your /etc/hosts (file path depends on OS):

127.0.0.1 hdfs

Build image for running tests:

docker-compose build

Start all containers with dependencies:

docker-compose up -d

You can run limited set of dependencies:

docker-compose up -d mongodb

Load environment variables with connection properties:

source .env.local

Run tests:

./run_tests.sh

You can pass additional arguments, they will be passed to pytest:

./run_tests.sh -m mongodb -lsx -vvvv --log-cli-level=INFO

Stop all containers and remove created volumes:

docker-compose down -v

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.10.2

Mar 21, 2024

0.10.1

Feb 5, 2024

0.10.0

Dec 17, 2023

0.9.5

Oct 10, 2023

0.9.4

Sep 26, 2023

0.9.3

Sep 6, 2023

0.9.2

Sep 6, 2023

0.9.1

Aug 17, 2023

0.9.0 yanked

Aug 17, 2023

Reason this release was yanked:

Bug in calculating number of workers in FileDownloader/Uploader/Mover

0.8.1

Jul 10, 2023

This version

0.8.0

May 31, 2023

0.7.2

May 24, 2023

0.7.1

May 23, 2023

0.7.0

May 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onetl-0.8.0.tar.gz (145.6 kB view hashes)

Uploaded May 31, 2023 Source

Built Distribution

onetl-0.8.0-py3-none-any.whl (231.3 kB view hashes)

Uploaded May 31, 2023 Python 3

Hashes for onetl-0.8.0.tar.gz

Hashes for onetl-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`875d772dfc2e5f3bad7708378d94213134b801c9cc66afaca9262c59a456ff2b`
MD5	`71a9e9c9d53e67f81b5bbbbd16fde5fb`
BLAKE2b-256	`b6e580337ec25effeff148517f10f562f97cc0fbddf3a68c0c633fa534779a10`

Hashes for onetl-0.8.0-py3-none-any.whl

Hashes for onetl-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7ff347db859e57b542770700948e6ddd343a97fb42f4a855ff9dfa2bb194e3d`
MD5	`ad442cfa7b02a63664374aff97707e9a`
BLAKE2b-256	`c71aa9efa3c2bc0e4fcd83121f359dafbd32d99077ef1edaf8591abfda736de4`

onetl 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

What is onETL?

Requirements

Supported storages

Documentation

Contribution guide

Security

How to install

Minimal installation

With DB connections

Compatibility matrix

With file connections

With Kerberos support

Full bundle

Develop

Clone repo

Setup environment

Enable pre-commit hooks

Tests

Using docker-compose

Run tests locally

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution