Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

Quickstart

To try out Tuplex, simply try out the following starter notebooks using Google Colab:

Name Link Description
(01) Intro to Tuplex Google Colab Basic commands to manipulate columns and modify data with user code
(02) Working with Files Google Colab Loading and saving files, detecting types.

More examples can be found here.

Installation

To install Tuplex, you can use a PyPi package for Linux or MacOS(Intel), or a Docker container which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-12.0 and Ubuntu 18.04/20.04/22.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew. If you want to build Tuplex with AWS support, you need macOS 10.13+.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Customizing the build

Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/ folder and import the CMakeLists.txt contained there.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
BUILD_WITH_ORC ON, OFF (default) build with ORC file format support.
BUILD_NATIVE ON, OFF (default) build with -march=native to target platform architecture.
SKIP_AWS_TESTS ON (default), OFF skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.6, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2022 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuplex-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

tuplex-0.3.4-cp39-cp39-macosx_12_0_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 12.0+ x86-64

tuplex-0.3.4-cp39-cp39-macosx_11_0_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 11.0+ x86-64

tuplex-0.3.4-cp39-cp39-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 10.13+ x86-64

tuplex-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

tuplex-0.3.4-cp38-cp38-macosx_10_13_x86_64.whl (17.3 MB view details)

Uploaded CPython 3.8macOS 10.13+ x86-64

tuplex-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

tuplex-0.3.4-cp37-cp37m-macosx_10_13_x86_64.whl (17.3 MB view details)

Uploaded CPython 3.7mmacOS 10.13+ x86-64

File details

Details for the file tuplex-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2bdd4b63cd1e0e4c885afad445cb5b679d050d8d133c0a48f65d1a426f03739b
MD5 0ca49199d7c248cbcbd640dbf39b6f9b
BLAKE2b-256 25f6e810aa7d4160b626becfa825288459b4e71df98367f4eb3b69b806cdb30b

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 ddacf1a1a349bd5065906ef999596cc07227baf54b0954ce36be319e422398b5
MD5 1111ca379f58652223eff7d8e5bd5744
BLAKE2b-256 b3db88a04965dbcf78711b5ff0735da5c044f0370e81cd024e4f29f760cbca55

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 69e5c41f8eff599ea860a239e676bfc26eb9f694face12481fe87c81b800b84e
MD5 e1f2614bcd3b6841cf55211dbd469edf
BLAKE2b-256 0dfe7c76f493512d563a83f974d89921a925763775f0a34e26fd698dcb330103

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 86113eea6aa7a69c9cd6af2c4feb10c2043ea143342c5f7a5529e9d5ec67cddd
MD5 cead40acc0c5b682854660dfc3d9b59b
BLAKE2b-256 0cadc9def3a8d1501c1970ba83aa23a73ec752fe1df45b40edb60424f7dd4b16

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c3c39d858d29a2828b457a05e62af7bda0fdb0a9ce86512cd41b8d39620e2f1
MD5 51fd87b63e30d787f3d44c39f4b79bfa
BLAKE2b-256 d6c5e5872295bb3bc07c6739b395238f7a3a07c439b4cc73711fdf7d1cdda1a4

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 fb51ef10699f8e2790bb1a9af8192b40e3dcfc09950e91efc04f5212b783f4fb
MD5 363f17cbebd2dafe16fbff3a6e0931b4
BLAKE2b-256 40055eee20567fc2811e719f61b677209ce70153af24f392db71a75b243e3602

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0339850d91680395bc5aa537c58777279d3a4818ba43c72d8599db0918d961f
MD5 4de54ec459ec01d1e13ea6a5c0cf51a1
BLAKE2b-256 4e7f6a3b8cd83baee0d3849387706cc56f4730a226ecea3bd99abfd43baef2e9

See more details on using hashes here.

File details

Details for the file tuplex-0.3.4-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.4-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 23964fd8688552c181ce619c0742a6eacbcfd9d296e10473cc515062c573380f
MD5 9204ad4718b5e55517002c7f45b43354
BLAKE2b-256 a82f197950aac4a84a3e37a0d2ec1f70e71d9fc7c12a5523c1bb9a43c97bf71f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page