onetl

One ETL tool to rule them all

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dolfinus mtsrus

These details have not been verified by PyPI

Project links

Documentation

Project description

What is onETL?

Python ETL/ELT library powered by Apache Spark & other open-source tools.

Goals

Provide unified classes to extract data from (E) & load data to (L) various stores.
Provides Spark DataFrame API for performing transformations (T) in terms of ETL.
Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
Support different read strategies, e.g. icremental reads.
Provide hooks & plugins mechanism for altering behavior of internal classes.

Non-goals

onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
No Spark streaming support of any kind, only batch operations are supported. For streaming prefer Apache Flink.

Requirements

Python 3.7 - 3.14
PySpark 3.2.x - 4.1.x (depends on used connector)
Java 8+ (required by Spark, see below)
Kerberos libs & GCC (required by Hive, HDFS and SparkHDFS connectors)

Supported storages

Type	Storage	Powered by
Database	Clickhouse	Apache Spark JDBC Data Source
	MSSQL
	MySQL
	Postgres
	Oracle
	Hive	Apache Spark Hive integration
	Iceberg	Apache Iceberg Spark integration
	Kafka	Apache Spark Kafka integration
	Greenplum	VMware Greenplum Spark connector
	MongoDB	MongoDB Spark connector
File	HDFS	HDFS Python client
	S3	minio-py client
	SFTP	Paramiko library
	FTP	FTPUtil library
	FTPS	FTPUtil library
	WebDAV	WebdavClient3 library
	Samba	pysmb library
Files as DataFrame	SparkLocalFS	Apache Spark File Data Source
	SparkHDFS	Apache Spark File Data Source
	SparkS3	Hadoop AWS library

Documentation

See https://onetl.readthedocs.io/

How to install

Minimal installation

Base onetl package contains:

DBReader, DBWriter and related classes
FileDownloader, FileUploader, FileMover and related classes, like file filters & limits
FileDFReader, FileDFWriter and related classes, like file formats
Read Strategies & HWM classes
Plugins support

It can be installed via:

pip install onetl

With DB and FileDF connections

All DB connection classes (Clickhouse, Greenplum, Hive and others) and all FileDF connection classes (SparkHDFS, SparkLocalFS, SparkS3) require Spark to be installed.

Firstly, you should install JDK. The exact installation instruction depends on your OS, here are some examples:

yum install java-1.8.0-openjdk-devel  # CentOS 7
dnf install java-11-openjdk-devel  # CentOS 8
apt-get install openjdk-11-jdk  # Debian-based

Compatibility matrix

Spark	Python	Java	Scala
3.2.x	3.7 - 3.10	8u201 - 11	2.12
3.3.x	3.7 - 3.12	8u201 - 17	2.12
3.4.x	3.7 - 3.12	8u362 - 20	2.12
3.5.x	3.8 - 3.13	8u371 - 20	2.12
4.0.x	3.9 - 3.14	17 - 22	2.13
4.1.x	3.10 - 3.14	17 - 22	2.13

Then you should install PySpark via passing spark to extras:

pip install "onetl[spark]"  # install latest PySpark

or install PySpark explicitly:

pip install onetl pyspark==3.5.8  # install a specific PySpark version

or inject PySpark to sys.path in some other way BEFORE creating a class instance. Otherwise connection object cannot be created.

With File connections

All File (but not FileDF) connection classes (FTP, SFTP, HDFS and so on) requires specific Python clients to be installed.

Each client can be installed explicitly by passing connector name (in lowercase) to extras:

pip install "onetl[ftp]"  # specific connector
pip install "onetl[ftp,ftps,sftp,hdfs,s3,webdav,samba]"  # multiple connectors

To install all file connectors at once you can pass files to extras:

pip install "onetl[files]"

Otherwise class import will fail.

With Kerberos support

Most of Hadoop instances set up with Kerberos support, so some connections require additional setup to work properly.

HDFS Uses requests-kerberos and GSSApi for authentication. It also uses kinit executable to generate Kerberos ticket.
Hive and SparkHDFS require Kerberos ticket to exist before creating Spark session.

So you need to install OS packages with:

krb5 libs
Headers for krb5
gcc or other compiler for C sources

The exact installation instruction depends on your OS, here are some examples:

apt install libkrb5-dev krb5-user gcc  # Debian-based
dnf install krb5-devel krb5-libs krb5-workstation gcc  # CentOS, OracleLinux

Also you should pass kerberos to extras to install required Python packages:

pip install "onetl[kerberos]"

Full bundle

To install all connectors and dependencies, you can pass all into extras:

pip install "onetl[all]"

# this is just the same as
pip install "onetl[spark,files,kerberos]"

Quick start

MSSQL → Hive

Read data from MSSQL, transform & write to Hive.

# install onETL and PySpark
pip install "onetl[spark]"

# Import pyspark to initialize the SparkSession
from pyspark.sql import SparkSession

# import function to setup onETL logging
from onetl.log import setup_logging

# Import required connections
from onetl.connection import MSSQL, Hive

# Import onETL classes to read & write data
from onetl.db import DBReader, DBWriter

# change logging level to INFO, and set up default logging format and handler
setup_logging()

# Initialize new SparkSession with MSSQL driver loaded
maven_packages = MSSQL.get_packages()
spark = (
    SparkSession.builder.appName("spark_app_onetl_demo")
    .config("spark.jars.packages", ",".join(maven_packages))
    .enableHiveSupport()  # for Hive
    .getOrCreate()
)

# Initialize MSSQL connection and check if database is accessible
mssql = MSSQL(
    host="mssqldb.demo.com",
    user="onetl",
    password="onetl",
    database="Telecom",
    spark=spark,
    # These options are passed to MSSQL JDBC Driver:
    extra={"applicationIntent": "ReadOnly"},
).check()

# >>> INFO:|MSSQL| Connection is available

# Initialize DBReader
reader = DBReader(
    connection=mssql,
    source="dbo.demo_table",
    columns=["on", "etl"],
    # Set some MSSQL read options:
    options=MSSQL.ReadOptions(fetchsize=10000),
)

# checks that there is data in the table, otherwise raises exception
reader.raise_if_no_data()

# Read data to DataFrame
df = reader.run()
df.printSchema()
# root
#  |-- id: integer (nullable = true)
#  |-- phone_number: string (nullable = true)
#  |-- region: string (nullable = true)
#  |-- birth_date: date (nullable = true)
#  |-- registered_at: timestamp (nullable = true)
#  |-- account_balance: double (nullable = true)

# Apply any PySpark transformations
from pyspark.sql.functions import lit

df_to_write = df.withColumn("engine", lit("onetl"))
df_to_write.printSchema()
# root
#  |-- id: integer (nullable = true)
#  |-- phone_number: string (nullable = true)
#  |-- region: string (nullable = true)
#  |-- birth_date: date (nullable = true)
#  |-- registered_at: timestamp (nullable = true)
#  |-- account_balance: double (nullable = true)
#  |-- engine: string (nullable = false)

# Initialize Hive connection
hive = Hive(cluster="rnd-dwh", spark=spark)

# Initialize DBWriter
db_writer = DBWriter(
    connection=hive,
    target="dl_sb.demo_table",
    # Set some Hive write options:
    options=Hive.WriteOptions(if_exists="replace_entire_table"),
)

# Write data from DataFrame to Hive
db_writer.run(df_to_write)

# Success!

SFTP → HDFS

Download files from SFTP & upload them to HDFS.

# install onETL with SFTP and HDFS clients, and Kerberos support
pip install "onetl[hdfs,sftp,kerberos]"

# import function to setup onETL logging
from onetl.log import setup_logging

# Import required connections
from onetl.connection import SFTP, HDFS

# Import onETL classes to download & upload files
from onetl.file import FileDownloader, FileUploader

# import filter & limit classes
from onetl.file.filter import Glob, ExcludeDir
from onetl.file.limit import MaxFilesCount

# change logging level to INFO, and set up default logging format and handler
setup_logging()

# Initialize SFTP connection and check it
sftp = SFTP(
    host="sftp.test.com",
    user="someuser",
    password="somepassword",
).check()

# >>> INFO:|SFTP| Connection is available

# Initialize downloader
file_downloader = FileDownloader(
    connection=sftp,
    source_path="/remote/tests/Report",  # path on SFTP
    local_path="/local/onetl/Report",  # local fs path
    filters=[
        # download only files matching the glob
        Glob("*.csv"),
        # exclude files from this directory
        ExcludeDir("/remote/tests/Report/exclude_dir/"),
    ],
    limits=[
        # download max 1000 files per run
        MaxFilesCount(1000),
    ],
    options=FileDownloader.Options(
        # delete files from SFTP after successful download
        delete_source=True,
        # mark file as failed if it already exist in local_path
        if_exists="error",
    ),
)

# Download files to local filesystem
download_result = downloader.run()

# Method run returns a DownloadResult object,
# which contains collection of downloaded files, divided to 4 categories
download_result

#  DownloadResult(
#      successful=[
#          LocalPath('/local/onetl/Report/file_1.json'),
#          LocalPath('/local/onetl/Report/file_2.json'),
#      ],
#      failed=[FailedRemoteFile('/remote/onetl/Report/file_3.json')],
#      ignored=[RemoteFile('/remote/onetl/Report/file_4.json')],
#      missing=[],
#  )

# Raise exception if there are failed files, or there were no files in the remote filesystem
download_result.raise_if_failed() or download_result.raise_if_empty()

# Do any kind of magic with files: rename files, remove header for csv files, ...
renamed_files = my_rename_function(download_result.success)

# function removed "_" from file names
# [
#    LocalPath('/home/onetl/Report/file1.json'),
#    LocalPath('/home/onetl/Report/file2.json'),
# ]

# Initialize HDFS connection
hdfs = HDFS(
    host="my.name.node",
    user="someuser",
    password="somepassword",  # or keytab
)

# Initialize uploader
file_uploader = FileUploader(
    connection=hdfs,
    target_path="/user/onetl/Report/",  # hdfs path
)

# Upload files from local fs to HDFS
upload_result = file_uploader.run(renamed_files)

# Method run returns a UploadResult object,
# which contains collection of uploaded files, divided to 4 categories
upload_result

#  UploadResult(
#      successful=[RemoteFile('/user/onetl/Report/file1.json')],
#      failed=[FailedLocalFile('/local/onetl/Report/file2.json')],
#      ignored=[],
#      missing=[],
#  )

# Raise exception if there are failed files, or there were no files in the local filesystem, or some input file is missing
upload_result.raise_if_failed() or upload_result.raise_if_empty() or upload_result.raise_if_missing()

# Success!

S3 → Postgres

Read files directly from S3 path, convert them to dataframe, transform it and then write to a database.

# install onETL and PySpark
pip install "onetl[spark]"

# Import pyspark to initialize the SparkSession
from pyspark.sql import SparkSession

# import function to setup onETL logging
from onetl.log import setup_logging

# Import required connections
from onetl.connection import Postgres, SparkS3

# Import onETL classes to read files
from onetl.file import FileDFReader
from onetl.file.format import CSV

# Import onETL classes to write data
from onetl.db import DBWriter

# change logging level to INFO, and set up default logging format and handler
setup_logging()

# Initialize new SparkSession with Hadoop AWS libraries and Postgres driver loaded
maven_packages = SparkS3.get_packages(spark_version="3.5.8") + Postgres.get_packages()
exclude_packages = SparkS3.get_exclude_packages()
spark = (
    SparkSession.builder.appName("spark_app_onetl_demo")
    .config("spark.jars.packages", ",".join(maven_packages))
    .config("spark.jars.excludes", ",".join(exclude_packages))
    .getOrCreate()
)

# Initialize S3 connection and check it
spark_s3 = SparkS3(
    host="s3.test.com",
    protocol="https",
    bucket="my-bucket",
    access_key="somekey",
    secret_key="somesecret",
    # Access bucket as s3.test.com/my-bucket
    path_style_access=True,
    spark=spark,
).check()

# >>> INFO:|SparkS3| Connection is available

# Describe file format and parsing options
csv = CSV(
    delimiter=";",
    header=True,
    encoding="utf-8",
)

# Describe DataFrame schema of files
from pyspark.sql.types import (
    DateType,
    DoubleType,
    IntegerType,
    StringType,
    StructField,
    StructType,
    TimestampType,
)

df_schema = StructType(
    [
        StructField("id", IntegerType()),
        StructField("phone_number", StringType()),
        StructField("region", StringType()),
        StructField("birth_date", DateType()),
        StructField("registered_at", TimestampType()),
        StructField("account_balance", DoubleType()),
    ],
)

# Initialize file df reader
reader = FileDFReader(
    connection=spark_s3,
    source_path="/remote/tests/Report",  # path on S3 there *.csv files are located
    format=csv,  # file format with specific parsing options
    df_schema=df_schema,  # columns & types
)

# Read files directly from S3 as Spark DataFrame
df = reader.run()

# Check that DataFrame schema is same as expected
df.printSchema()
# root
#  |-- id: integer (nullable = true)
#  |-- phone_number: string (nullable = true)
#  |-- region: string (nullable = true)
#  |-- birth_date: date (nullable = true)
#  |-- registered_at: timestamp (nullable = true)
#  |-- account_balance: double (nullable = true)

# Apply any PySpark transformations
from pyspark.sql.functions import lit

df_to_write = df.withColumn("engine", lit("onetl"))
df_to_write.printSchema()
# root
#  |-- id: integer (nullable = true)
#  |-- phone_number: string (nullable = true)
#  |-- region: string (nullable = true)
#  |-- birth_date: date (nullable = true)
#  |-- registered_at: timestamp (nullable = true)
#  |-- account_balance: double (nullable = true)
#  |-- engine: string (nullable = false)

# Initialize Postgres connection
postgres = Postgres(
    host="192.169.11.23",
    user="onetl",
    password="somepassword",
    database="mydb",
    spark=spark,
)

# Initialize DBWriter
db_writer = DBWriter(
    connection=postgres,
    # write to specific table
    target="public.my_table",
    # with some writing options
    options=Postgres.WriteOptions(if_exists="append"),
)

# Write DataFrame to Postgres table
db_writer.run(df_to_write)

# Success!

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dolfinus mtsrus

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.16.0

May 12, 2026

0.15.1

Apr 15, 2026

0.15.0

Dec 8, 2025

0.14.1

Nov 25, 2025

0.14.0

Sep 8, 2025

0.13.5

Apr 17, 2025

0.13.4

Mar 20, 2025

0.13.3

Mar 11, 2025

0.13.2 yanked

Mar 11, 2025

Reason this release was yanked:

Broken AuthDetectHWM

0.13.1

Mar 6, 2025

0.13.0 yanked

Feb 23, 2025

Reason this release was yanked:

Combination of DBWriter + Hive + uppercase table name leads to recreating entire table, ignoring `if_exists="append"`

0.12.5

Dec 3, 2024

0.12.4

Nov 27, 2024

0.12.3

Nov 22, 2024

0.12.2

Nov 12, 2024

0.12.1

Oct 28, 2024

0.12.0

Sep 3, 2024

0.11.2

Sep 2, 2024

0.11.1

May 29, 2024

0.11.0

May 27, 2024

0.10.2

Mar 21, 2024

0.10.1

Feb 5, 2024

0.10.0

Dec 17, 2023

0.9.5

Oct 10, 2023

0.9.4

Sep 26, 2023

0.9.3

Sep 6, 2023

0.9.2

Sep 6, 2023

0.9.1

Aug 17, 2023

0.9.0 yanked

Aug 17, 2023

Reason this release was yanked:

Bug in calculating number of workers in FileDownloader/Uploader/Mover

0.8.1

Jul 10, 2023

0.8.0

May 31, 2023

0.7.2

May 24, 2023

0.7.1

May 23, 2023

0.7.0

May 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onetl-0.16.0.tar.gz (257.8 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onetl-0.16.0-py3-none-any.whl (385.9 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file onetl-0.16.0.tar.gz.

File metadata

Download URL: onetl-0.16.0.tar.gz
Upload date: May 12, 2026
Size: 257.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for onetl-0.16.0.tar.gz
Algorithm	Hash digest
SHA256	`d21d081a601b10915206bd9dcf7c6b83ad87fd68dbc94801cf34eb22c5768226`
MD5	`6a99b851dae7b1a814a43b96ca704911`
BLAKE2b-256	`a44da4472b171dcb9761f94cc395188365785ff7b2d3a316b063d8875d5a4efd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for onetl-0.16.0.tar.gz:

Publisher: release.yml on MTSWebServices/onetl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: onetl-0.16.0.tar.gz
- Subject digest: d21d081a601b10915206bd9dcf7c6b83ad87fd68dbc94801cf34eb22c5768226
- Sigstore transparency entry: 1519099806
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: MTSWebServices/onetl@dfadd8b144a367471ead8788ffeeac27e678b2c5
- Branch / Tag: refs/tags/0.16.0
- Owner: https://github.com/MTSWebServices
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@dfadd8b144a367471ead8788ffeeac27e678b2c5
- Trigger Event: push

File details

Details for the file onetl-0.16.0-py3-none-any.whl.

File metadata

Download URL: onetl-0.16.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 385.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for onetl-0.16.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`faa48c8491591f5a9fb7783f89036205f7185d60305deb2629cf7b7a43d9cc4c`
MD5	`0dd4b5e9c45e677ea6002faa5b0ace81`
BLAKE2b-256	`7ac74ce83ec7bd95c3c2910d6f0fab368229114924987c2678c357882160098b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for onetl-0.16.0-py3-none-any.whl:

Publisher: release.yml on MTSWebServices/onetl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: onetl-0.16.0-py3-none-any.whl
- Subject digest: faa48c8491591f5a9fb7783f89036205f7185d60305deb2629cf7b7a43d9cc4c
- Sigstore transparency entry: 1519099867
- Sigstore integration time: May 12, 2026
Source repository:
- Permalink: MTSWebServices/onetl@dfadd8b144a367471ead8788ffeeac27e678b2c5
- Branch / Tag: refs/tags/0.16.0
- Owner: https://github.com/MTSWebServices
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@dfadd8b144a367471ead8788ffeeac27e678b2c5
- Trigger Event: push

onetl 0.16.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is onETL?

Goals

Non-goals

Requirements

Supported storages

Documentation

How to install

Minimal installation

With DB and FileDF connections

Compatibility matrix

With File connections

With Kerberos support

Full bundle

Quick start

MSSQL → Hive

SFTP → HDFS

S3 → Postgres

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance