Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

Quickstart

To try out Tuplex, simply try out the following starter notebooks using Google Colab:

Name Link Description
1. Intro to Tuplex Google Colab Basic commands to manipulate columns and modify data with user code.
2. Working with Files Google Colab Loading and saving files, detecting types.

More examples can be found here.

Installation

To install Tuplex, you can use a PyPi package for Linux or MacOS(Intel), or a Docker container which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex:v0.3.6

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13+ and Ubuntu 20.04/22.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew. If you want to build Tuplex with AWS support, you need macOS 10.13+. Python 3.9 or earlier requires an older cloudpickle version (1.6.0) whereas Python 3.10+ requires cloudpickle 2.1.0+.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04, or scripts/ubuntu2204/install_reqs.sh for Ubuntu 22.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu2204/install_reqs.sh
python3 -m pip install cloudpickle numpy
python3 setup.py install --user

Customizing the build

Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/ folder and import the CMakeLists.txt contained there.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
BUILD_WITH_ORC ON, OFF (default) build with ORC file format support.
BUILD_NATIVE ON, OFF (default) build with -march=native to target platform architecture.
SKIP_AWS_TESTS ON (default), OFF skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.8, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2023 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tuplex-0.3.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (178.4 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tuplex-0.3.6-cp311-cp311-macosx_13_0_arm64.whl (39.1 MB view details)

Uploaded CPython 3.11 macOS 13.0+ ARM64

tuplex-0.3.6-cp311-cp311-macosx_11_0_x86_64.whl (29.7 MB view details)

Uploaded CPython 3.11 macOS 11.0+ x86-64

tuplex-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (163.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tuplex-0.3.6-cp310-cp310-macosx_13_0_arm64.whl (39.1 MB view details)

Uploaded CPython 3.10 macOS 13.0+ ARM64

tuplex-0.3.6-cp310-cp310-macosx_11_0_x86_64.whl (29.7 MB view details)

Uploaded CPython 3.10 macOS 11.0+ x86-64

tuplex-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tuplex-0.3.6-cp39-cp39-macosx_13_0_arm64.whl (39.1 MB view details)

Uploaded CPython 3.9 macOS 13.0+ ARM64

tuplex-0.3.6-cp39-cp39-macosx_11_0_x86_64.whl (29.7 MB view details)

Uploaded CPython 3.9 macOS 11.0+ x86-64

tuplex-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (136.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tuplex-0.3.6-cp38-cp38-macosx_10_13_x86_64.whl (29.7 MB view details)

Uploaded CPython 3.8 macOS 10.13+ x86-64

File details

Details for the file tuplex-0.3.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f6c936d556efe2edaf80a66792773da674664a362ef39897ff1b83f86f57a3eb
MD5 c1bfba3852dfdff59acbe3005cecdfc0
BLAKE2b-256 6cc0c50e90248abdb2bf03303b7ed13284a228d740ee7129544b589d0199fc0e

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp311-cp311-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp311-cp311-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 980480f9e4d25cdc1b68b48a1cfdf5faaf69573bb5e9b3f5915f3f8ce1111254
MD5 ef869d5b24e4672e1485a75280f2cc42
BLAKE2b-256 6f656862ac0ed7ba43093b67cb578b46a22687c738ac7d6ef0167083a9812510

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 0e77efa3dda2c204c1169f37dafde973825f28855645b5f437bac08dd0debe69
MD5 8da5082f4502b7944ec4f59196f59f88
BLAKE2b-256 b2f47bfcf8de7acc2ac8806e59817e7c5ca54b02a2545e776ed3d19c7e453f70

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cb6c988b3f31a656e82d147a3c22c06e0d20489e1d0778cc7ff65f70c6564b57
MD5 43855e5d34071c5bc709dcde6f1cb457
BLAKE2b-256 41bad171412df402bb2970cd066dda6486e29a24d9c13ad176cf258b9a000ab9

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp310-cp310-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp310-cp310-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 730b3e2dfc7a6db9b577316f640a79645fce48bdc0c7eb75ae98ad4dcac5954c
MD5 963873a16c6b27fa2d40ef347d4edf5b
BLAKE2b-256 ffcf7d423da12f50bea18afeaed154bcf488e289b95a718996f00184cdacf68b

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 f9a67fb1edd9a86aa1cd1aaf589a193e6794c2bdbaf09d89812145424f489570
MD5 3f98a8b7ebd319ba9aeb80f1566466a2
BLAKE2b-256 f8a3e1462855f37c7caf8831fa69ac7806ff8cd07c22b8a019072d7006f9e40e

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c6365e5fbf54773a9796f1a9de87fdfd0a25be65b4ecf98676162c989786652
MD5 85c86dcd0e7a3eee19a0300ec237bd60
BLAKE2b-256 764625f0f6aa90d94012d23181fed46953acd577dd3288b0936d77256caf88f2

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp39-cp39-macosx_13_0_arm64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp39-cp39-macosx_13_0_arm64.whl
Algorithm Hash digest
SHA256 51e58163ddda4f9a8f1adfb93e4ccab4e3f826e2166df8d887dc62ece2a0b599
MD5 48c7b47e0285eec6b469e6c6cca9c7ba
BLAKE2b-256 54459a52673fbf5644758fbab346431edb8b8f8a585e29c4fea9746d57c6c2d3

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 975c899d79bf77b3ef4d40606cf52e7420745ce714324b6438542a66218a9299
MD5 ffc476b0fe4668c9790f1b31263cfa65
BLAKE2b-256 67468dde6059817329ff7a9a578a5accc6f8ceacdcbe68c1800df89b6b89fcac

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 576d6f9c8c88ae97dfcd928370b95dd346500860fd9dbf61c23e657648cd9ee9
MD5 e06b929442ef177628c28fd3b8dee047
BLAKE2b-256 14c4a05c8ecabf9a87a4adcce7d84dd4d12336cefaef9d4f0d50c9e896ef0afe

See more details on using hashes here.

File details

Details for the file tuplex-0.3.6-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.6-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 dddb286ae2a9e68a17449d02500899bee5564e1f21419e7e584266a093912701
MD5 553b13c5e22d1e41ac545f428d486e82
BLAKE2b-256 3c3ca643cc3873e92940b5204e5b6cfef17ecad071563ae02337c9543979584d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page