Tuplex is a novel big data analytics framework incorporating a Python UDF compiler based on LLVM together with a query compiler featuring whole-stage code generation and optimization.
Project description
Tuplex: Blazing Fast Python Data Science
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.
You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.
Contributions welcome!
Contents
Example
Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:
from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)
Quickstart
To try out Tuplex, simply try out the following starter notebooks using Google Colab:
Name | Link | Description |
---|---|---|
1. Intro to Tuplex | Google Colab | Basic commands to manipulate columns and modify data with user code. |
2. Working with Files | Google Colab | Loading and saving files, detecting types. |
More examples can be found here.
Installation
To install Tuplex, you can use a PyPi package for Linux or MacOS(Intel), or a Docker container which will launch a jupyter notebook with Tuplex preinstalled.
Docker
docker run -p 8888:8888 tuplex/tuplex:v0.3.5
PyPI
pip install tuplex
Building
Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-12.0 and Ubuntu 18.04/20.04/22.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.
MacOS build from source
To build Tuplex, you need several other packages first which can be easily installed via brew. If you want to build Tuplex with AWS support, you need macOS 10.13+
. Python 3.9 or earlier requires an older cloudpickle version (1.6.0) whereas Python 3.10+ requires cloudpickle 2.1.0+.
brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero protobuf libmagic
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user
Ubuntu build from source
To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh
for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh
for Ubuntu 20.04). To create an up to date version of Tuplex, simply run
./scripts/ubuntu1804/install_reqs.sh
python3 -m pip install 'cloudpickle<2.0' numpy
python3 setup.py install --user
Customizing the build
Besides building a pip package, especially for development it may be more useful to invoke cmake directly. To create a development version of Tuplex and work with it like a regular cmake project, go to the folder tuplex
and then use the standard workflow to compile the package via cmake (and not the top-level setup.py file):
mkdir build
cd build
cmake ..
make -j$(nproc)
The python package corresponding to Tuplex can be then found in build/dist/python
with C++ test executables based on googletest in build/dist/bin
. If you'd like to use a cmake-compatible IDE like CLion or VSCode you can simply open the tuplex/
folder and import the CMakeLists.txt
contained there.
To customize the cmake build, the following options are available to be passed via -D<option>=<value>
:
option | values | description |
---|---|---|
CMAKE_BUILD_TYPE |
Release (default), Debug , RelWithDebInfo , tsan , asan , ubsan |
select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers. |
BUILD_WITH_AWS |
ON (default), OFF |
build with AWS SDK or not. On Ubuntu this will build the Lambda executor. |
BUILD_WITH_ORC |
ON , OFF (default) |
build with ORC file format support. |
BUILD_NATIVE |
ON , OFF (default) |
build with -march=native to target platform architecture. |
SKIP_AWS_TESTS |
ON (default), OFF |
skip aws tests, helpful when no AWS credentials/AWS Tuplex chain is setup. |
GENERATE_PDFS |
ON , OFF (default) |
output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz ) for ASTs of UDFs, query plans, ... |
PYTHON3_VERSION |
3.7 , ... |
when trying to select a python3 version to build against, use this by specifying major.minor . To specify the python executable, use the options provided by cmake. |
LLVM_ROOT_DIR |
e.g. /usr/lib/llvm-9 |
specify which LLVM version to use |
BOOST_DIR |
e.g. /opt/boost |
specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex |
For example, to create a debug build which outputs PDFs use the following snippet:
cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..
License
Tuplex is available under Apache 2.0 License, to cite the paper use:
@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}
(c) 2017-2022 Tuplex contributors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for tuplex-0.3.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24e2a46e6df2e473e52be52e4ab3b57eeb8ee4667da17517fabdf09dc2f7f6de |
|
MD5 | 8b094535383b016ff4604898dc7fbf9b |
|
BLAKE2b-256 | faad7d98ceba1b980f4770cc6266c94cd137d161acc70305083f0391b1464c1d |
Hashes for tuplex-0.3.5-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a362076e45cd195a98c930d872dcdd9f2bf620b296debaf416325d4998c7c9a9 |
|
MD5 | 4ecb2673067be507ee80ed21cc8a4a97 |
|
BLAKE2b-256 | 32eac08f09274b60751211435a9340fd461ca50a9cd9c580d65cf153c604e3f9 |
Hashes for tuplex-0.3.5-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca74b70607d6efa20596c9c4f1b008223c42512dfd59a45448161cb2474f1328 |
|
MD5 | 1d187616b775b4a12c359ad3bc97f163 |
|
BLAKE2b-256 | 5371146cbff0fe6b35028d05b32a50b60d2827557d7cda53ab444110688f133f |
Hashes for tuplex-0.3.5-cp39-cp39-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44aceeff9cc6fd7352fddc340bbf8e8e741978686a7bf0130a4ef86e73338efe |
|
MD5 | d6c8692e256df273b3e348c3e9902f68 |
|
BLAKE2b-256 | fbe1eb5d168cafcb1af92d15c92289050bbc38835885cd0b7716ed58892c7283 |
Hashes for tuplex-0.3.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fada5dd6d1bff96fe14df869bc99374ccef5470292b114fbcb2790db8f1ae502 |
|
MD5 | 60518a16376ffd9ecdd053c426a61c92 |
|
BLAKE2b-256 | 386d5a62de81abc02973a2b940f7312622e12754e40cf2f23d159ef19722aafd |
Hashes for tuplex-0.3.5-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f617b3bfb47a5bc4cef87e410803d99d4b85f7822013286c5857f069654f5b1 |
|
MD5 | e04c818957d4960a55d0db0d021eb079 |
|
BLAKE2b-256 | eab2cfeca1bcd76adeb45ea220c3889412cc2c4d814b9a3482660878afb769d2 |
Hashes for tuplex-0.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e23e705903803eeb29e5b7106d44211668164cf230663073c27607a736526db |
|
MD5 | cf4c1c7b892bf988f4f331158deb36c6 |
|
BLAKE2b-256 | b82771720663911a713b28c8cc5d2e3ceb2f8acb5f47778dbe3084d228319b04 |
Hashes for tuplex-0.3.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 435e36afe9b1974ed056e1573b8339e23e818439471749509f0367b6e7c9bf60 |
|
MD5 | b2b0df06f8f3ffaff5f9c4a6a082bf38 |
|
BLAKE2b-256 | 2d719884cf925afdcfc3365f933ff55b0df7fc8acf52e21fb220d75e4d59e82d |