Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Installation

To install Tuplex, you can use a PyPi package for Linux, or a Docker container for MacOS which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.15 and Ubuntu 18.04 and 20.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install cloudpickle numpy
python3 setup.py install --user

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip install cloudpickle numpy
python3 setup.py install --user

Customizing the build

Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/ folder and import the CMakeLists.txt contained there.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
BUILD_WITH_ORC ON, OFF (default) build with ORC file format support.
BUILD_NATIVE ON, OFF (default) build with -march=native to target platform architecture.
SKIP_AWS_TESTS ON (default), OFF skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.6, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

More examples can be found here.

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2022 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuplex-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

tuplex-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

tuplex-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

tuplex-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

tuplex-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

tuplex-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file tuplex-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 29.0 MB
  • Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4cb1eb16d68db271b6b012871c370fd7567af0e81a070de8b8b3262ba81f728
MD5 b8d936f897a15216d82d00d9400e3447
BLAKE2b-256 872864e7732f0193033192118cf78407ed3819d1b74b654b897e7c1b01dc2c57

See more details on using hashes here.

File details

Details for the file tuplex-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 16.8 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 07b08f0b7339c63620479580fc62a78dc8fc685fac8d09224b3b7429fb885301
MD5 1912f37aaa91141980a34de53fa20c2e
BLAKE2b-256 650f827a7b30afa71158a6c2cc6d21176cfe0afeb1e66bc9574d46ddbd120bee

See more details on using hashes here.

File details

Details for the file tuplex-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 29.0 MB
  • Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 821b221283da0d72ad0fa55ba2b16bb16fabe001806081cd357835bb44e58c6c
MD5 da5188a8f32d9c0e7618a1070e790f68
BLAKE2b-256 0e8023c63c00e7e959d3264f79d11257f292636b3c50dfd8724c88f3948961bc

See more details on using hashes here.

File details

Details for the file tuplex-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 16.8 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ef148751626ffe2d370b81c4171779a74be34c95279b61ad655206228c22e6b1
MD5 813630b3606f42e8744baf48079ededf
BLAKE2b-256 2f3e5a522fb3eb431d9959f3929089e8a8ff5095f1dee16685099966a7596e68

See more details on using hashes here.

File details

Details for the file tuplex-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 29.0 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7df020738ea244deb99996d2749c254a1a29dc8334048b0d265614d65bca2a8e
MD5 65b986d848de01c0b21edd312f0e7aba
BLAKE2b-256 db02af25b179364d6c7e7dcf635f6835a42da35bbf20010d6e7f2ba66214a154

See more details on using hashes here.

File details

Details for the file tuplex-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: tuplex-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 16.8 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for tuplex-0.3.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e5d695a7cd1c9a993787b9fbe29603a26f1dd1f84e09ba076fe3a76b0fd603c1
MD5 35eb06934c75f9db52e11b2841f284ea
BLAKE2b-256 fb884833a7a90e67dda18bd8900c415156a38f20dd9ff9c6e503e468a1353eb4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page