Skip to main content

Python client for the Impala distributed query engine

Project description

# impyla

Python client for the Impala distributed query engine.

### Features

Fully implemented:

* Lightweight, `pip`-installable package for connecting to Impala databases

* Fully [DB API 2.0 (PEP 249)][pep249]-compliant Python client (similar to
sqlite or MySQL clients) supporting Python 2 and Python 3.

* Runs on HiveServer2 and Beeswax; runs with Kerberos

* Converter to [pandas][pandas] `DataFrame`, allowing easy integration into the
Python data stack (including [scikit-learn][sklearn] and

In various phases of maturity:

* SQLAlchemy connector; integration with Blaze

* `BigDataFrame` abstraction for performing `pandas`-style analytics on large
datasets (similar to Spark's RDD abstraction); computation is pushed into the
Impala engine.

* `scikit-learn`-flavored wrapper for [MADlib][madlib]-style prediction,
allowing for large-scale, distributed machine learning (see
[the Impala port of MADlib][madlibport])

* Compiling UDFs written in Python into low-level machine code for execution by
Impala (powered by [Numba][numba]/[LLVM][llvm])

### Dependencies

Required for DB API connectivity:

* `python2.6` or `python2.7`

* `six`

* `thrift>=0.8` (Python package only; no need for code-gen) for Python 2, or
`thriftpy` for Python 3

Required for UDFs:

* `numba<=0.13.4` (which has a few requirements, like LLVM)

* `boost` (because `udf.h` depends on `boost/cstdint.hpp`)

Required for SQLAlchemy integration (and Blaze):

* `sqlalchemy`

Required for `BigDataFrame`:

* `pandas`

Required for Kerberos support:

* `python-sasl` (for Python 3 support, requires laserson/python-sasl@cython)

Required for utilizing automated shipping/registering of code/UDFs/BDFs/etc:

* `hdfs[kerberos]` (a Python client that wraps WebHDFS; kerberos is optional)

For manipulating results as pandas `DataFrame`s, we recommend installing pandas

Generally, we recommend installing all the libraries above; the UDF libraries
will be the most difficult, and are not required if you will not use any Python
UDFs. Interacting with Impala using the `ImpalaContext` will simplify shipping
data and will perform cleanup on temporary data/tables.

This project is installed with `setuptools`.

### Installation

Install the latest release (`0.10.0`) with `pip`:

pip install impyla

For the latest (dev) version, clone the repo:

git clone
cd impyla
make # optional: only for Numba-compiled UDFs; requires LLVM/clang
python install

#### Running the tests

impyla uses the [pytest][pytest] toolchain, and depends on the following environment

# beeswax might work here too
export IMPALA_PORT=21050
export IMPALA_PROTOCOL=hiveserver2
# needed to push data to the cluster
export WEBHDFS_PORT=50070

To run the maximal set of tests, run

py.test --dbapi-compliance path/to/impyla/impala/tests

Leave out the `--dbapi-compliance` option to skip tests for DB API compliance.
Add a `--udf` option to only run local UDF compilation tests.

### Quickstart

Impyla implements the [Python DB API v2.0 (PEP 249)][pep249] database interface
(refer to it for API details):

from impala.dbapi import connect
conn = connect(host='', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()

**Note**: if connecting to Impala through the *HiveServer2* service, make sure
to set the port to the HiveServer2 port (defaults to 21050 in CM), not Beeswax
(defaults to 21000) which is what the Impala shell uses.

The `Cursor` object also exposes the iterator interface, which is buffered
(controlled by `cursor.arraysize`):

cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:

You can also get back a pandas DataFrame object

from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example


Project details

Release history Release notifications

History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


This version
History Node


History Node


History Node


History Node


History Node


History Node


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
impyla-0.10.0.tar.gz (173.2 kB) Copy SHA256 hash SHA256 Source None Jun 4, 2015

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page