Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Installation

To install Tuplex, you can use a PyPi package for Linux, or a Docker container for MacOS which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.15 and Ubuntu 18.04 and 20.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero
python3 -m pip cloudpickle numpy
python3 setup.py install

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip cloudpickle numpy
python3 setup.py install

Customizing the build

Besides building a pip package, cmake can be also directly invoked. To compile the package via cmake

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.6, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

More examples can be found here.

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2021 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuplex-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

tuplex-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

tuplex-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.5 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

File details

Details for the file tuplex-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9efb88264bf55a555fac2c21e468f4bceed96cdfa90afa99d6b5bb23b89c95b7
MD5 b7350f3622f63634ebcb627043ced5e2
BLAKE2b-256 cd0fe667c56baa496119565c94e75fb34b427b086b34669ac1cd6d1fceb95911

See more details on using hashes here.

File details

Details for the file tuplex-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f765d604dea2a2eb9d2e1dfe9b5a56a8afc67c00a2275316f4fe69d748a2c4c4
MD5 c5c1a66af3ecb563349b8902e840c15e
BLAKE2b-256 142f0611957384a95b4c1374e0e6980250b7528ae8473dfe56701f23b6508529

See more details on using hashes here.

File details

Details for the file tuplex-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a355c710f3e5d4a7d42de3636fb8d7a1080e2d4b6c7407e4cd3d02508b595273
MD5 b655e2064124bd1b3f0832f60c56c09b
BLAKE2b-256 c69b8128e0987de2ad28c19785346f69876e79b2daee2d9580c9aecd65a209b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page