Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

Quickstart

To try out Tuplex, simply try out the following starter notebooks using Google Colab:

Name Link Description
(01) Intro to Tuplex Google Colab Basic commands to manipulate columns and modify data with user code
(02) Working with Files Google Colab Loading and saving files, detecting types.

More examples can be found here.

Installation

To install Tuplex, you can use a PyPi package for Linux or MacOS(Intel), or a Docker container which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-12.0 and Ubuntu 18.04/20.04/22.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew. If you want to build Tuplex with AWS support, you need macOS 10.13+.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Customizing the build

Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/ folder and import the CMakeLists.txt contained there.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
BUILD_WITH_ORC ON, OFF (default) build with ORC file format support.
BUILD_NATIVE ON, OFF (default) build with -march=native to target platform architecture.
SKIP_AWS_TESTS ON (default), OFF skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.6, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2022 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuplex-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

tuplex-0.3.3-cp39-cp39-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 10.13+ x86-64

tuplex-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

tuplex-0.3.3-cp38-cp38-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.8macOS 10.13+ x86-64

tuplex-0.3.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.0 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

tuplex-0.3.3-cp37-cp37m-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.7mmacOS 10.13+ x86-64

File details

Details for the file tuplex-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fea116badd24c097207a473468cc3d900b2962f72a99700253d846fca18802d2
MD5 7ad036739b3b63ddd1c0a71b4937bc9f
BLAKE2b-256 b6ef812b63ad9ee390f6f1ed4fd29c3c6a13fa7db83cacb3ca533e61c7ad79b9

See more details on using hashes here.

File details

Details for the file tuplex-0.3.3-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 47937715c13afd8f3c8ea166732527f0cbe24a7e43782952d612bd25cb98f4e2
MD5 91621619b1b3f7cfe6544d72c2f9b18b
BLAKE2b-256 fa097c9241998d8cf6289aaa8a6a90eeba81ca759902370959f2ad2cf8bfb1fb

See more details on using hashes here.

File details

Details for the file tuplex-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de30948845657442b9a95dc1c89fc861f708553b9b262b8419b787253461bc9b
MD5 c292ad1e313ed276fcfc51c87e7bf9de
BLAKE2b-256 7b1c6bb5a0ac6224d67c961636b87c3a009402a2c89ef7e71ab4c8c58e885a9c

See more details on using hashes here.

File details

Details for the file tuplex-0.3.3-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 dd9c9cd490d2831c1eabad306addc2d8dcb9f5b3d73a142ecc1d197394466075
MD5 7ec6ef9b0e6f13a6eb10c966f7530925
BLAKE2b-256 042f42dd5d11e2d5947207e4575ae6dc53f8edbd8593a8bd3ad9b06961a803d8

See more details on using hashes here.

File details

Details for the file tuplex-0.3.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 14eded6f84df6c430e8ad3ab19c7bb191de2a439508c0fa47150e11617a23848
MD5 810b6eee8b31b1f2fe24e25fa8041065
BLAKE2b-256 3bac5bf3b041fde7007f5e110d7c823d577073aeba7d0a0305cd02f8f56d4d99

See more details on using hashes here.

File details

Details for the file tuplex-0.3.3-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.3-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 f337041fc57b627d88fe4099cb4070af3970a705db8746ac0e65ddaf73e18915
MD5 da3e35564edbe1b8774530776daec7b5
BLAKE2b-256 b4bf95313b73a11520253bdf13e2619bcbb7d5a6de1f45488a10683782edab73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page