Skip to main content

Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.

Project description

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

Quickstart

To try out Tuplex, simply try out the following starter notebooks using Google Colab:

Name Link Description
1. Intro to Tuplex Google Colab Basic commands to manipulate columns and modify data with user code.
2. Working with Files Google Colab Loading and saving files, detecting types.

More examples can be found here.

Installation

To install Tuplex, you can use a PyPi package for Linux or MacOS(Intel), or a Docker container which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex:v0.3.5

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-12.0 and Ubuntu 18.04/20.04/22.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew. If you want to build Tuplex with AWS support, you need macOS 10.13+. Python 3.9 or earlier requires an older cloudpickle version (1.6.0) whereas Python 3.10+ requires cloudpickle 2.1.0+.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user

Customizing the build

Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/ folder and import the CMakeLists.txt contained there.

To customize the cmake build, the following options are available to be passed via -D<option>=<value>:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
BUILD_WITH_ORC ON, OFF (default) build with ORC file format support.
BUILD_NATIVE ON, OFF (default) build with -march=native to target platform architecture.
SKIP_AWS_TESTS ON (default), OFF skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.7, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2022 Tuplex contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tuplex-0.3.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

tuplex-0.3.5-cp39-cp39-macosx_12_0_x86_64.whl (17.3 MB view details)

Uploaded CPython 3.9macOS 12.0+ x86-64

tuplex-0.3.5-cp39-cp39-macosx_11_0_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 11.0+ x86-64

tuplex-0.3.5-cp39-cp39-macosx_10_13_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.9macOS 10.13+ x86-64

tuplex-0.3.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

tuplex-0.3.5-cp38-cp38-macosx_10_13_x86_64.whl (17.3 MB view details)

Uploaded CPython 3.8macOS 10.13+ x86-64

tuplex-0.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.1 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

tuplex-0.3.5-cp37-cp37m-macosx_10_13_x86_64.whl (17.3 MB view details)

Uploaded CPython 3.7mmacOS 10.13+ x86-64

File details

Details for the file tuplex-0.3.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 24e2a46e6df2e473e52be52e4ab3b57eeb8ee4667da17517fabdf09dc2f7f6de
MD5 8b094535383b016ff4604898dc7fbf9b
BLAKE2b-256 faad7d98ceba1b980f4770cc6266c94cd137d161acc70305083f0391b1464c1d

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 a362076e45cd195a98c930d872dcdd9f2bf620b296debaf416325d4998c7c9a9
MD5 4ecb2673067be507ee80ed21cc8a4a97
BLAKE2b-256 32eac08f09274b60751211435a9340fd461ca50a9cd9c580d65cf153c604e3f9

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 ca74b70607d6efa20596c9c4f1b008223c42512dfd59a45448161cb2474f1328
MD5 1d187616b775b4a12c359ad3bc97f163
BLAKE2b-256 5371146cbff0fe6b35028d05b32a50b60d2827557d7cda53ab444110688f133f

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp39-cp39-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 44aceeff9cc6fd7352fddc340bbf8e8e741978686a7bf0130a4ef86e73338efe
MD5 d6c8692e256df273b3e348c3e9902f68
BLAKE2b-256 fbe1eb5d168cafcb1af92d15c92289050bbc38835885cd0b7716ed58892c7283

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fada5dd6d1bff96fe14df869bc99374ccef5470292b114fbcb2790db8f1ae502
MD5 60518a16376ffd9ecdd053c426a61c92
BLAKE2b-256 386d5a62de81abc02973a2b940f7312622e12754e40cf2f23d159ef19722aafd

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp38-cp38-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 4f617b3bfb47a5bc4cef87e410803d99d4b85f7822013286c5857f069654f5b1
MD5 e04c818957d4960a55d0db0d021eb079
BLAKE2b-256 eab2cfeca1bcd76adeb45ea220c3889412cc2c4d814b9a3482660878afb769d2

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e23e705903803eeb29e5b7106d44211668164cf230663073c27607a736526db
MD5 cf4c1c7b892bf988f4f331158deb36c6
BLAKE2b-256 b82771720663911a713b28c8cc5d2e3ceb2f8acb5f47778dbe3084d228319b04

See more details on using hashes here.

File details

Details for the file tuplex-0.3.5-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for tuplex-0.3.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 435e36afe9b1974ed056e1573b8339e23e818439471749509f0367b6e7c9bf60
MD5 b2b0df06f8f3ffaff5f9c4a6a082bf38
BLAKE2b-256 2d719884cf925afdcfc3365f933ff55b0df7fc8acf52e21fb220d75e4d59e82d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page