Skip to main content

Python client for the Impala distributed query engine

Project description

# impyla

Python client for the Impala distributed query engine.


### Features

Fully supported:

* Lightweight, `pip`-installable package for connecting to Impala databases

* Fully [DB API 2.0 (PEP 249)][pep249]-compliant Python client (similar to
sqlite or MySQL clients)

* Support for HiveServer2 and Beeswax; support for Kerberos

* Converter to [pandas][pandas] `DataFrame`, allowing easy integration into the
Python data stack (including [scikit-learn][sklearn] and
[matplotlib][matplotlib])

In various phases of maturity:

* SQLAlchemy connector; integration with Blaze

* `BigDataFrame` abstraction for performing `pandas`-style analytics on large
datasets (similar to Spark's RDD abstraction); computation is pushed into the
Impala engine.

* `scikit-learn`-flavored wrapper for [MADlib][madlib]-style prediction,
allowing for large-scale, distributed machine learning (see
[the Impala port of MADlib][madlibport])

* Compiling UDFs written in Python into low-level machine code for execution by
Impala (powered by [Numba][numba]/[LLVM][llvm])


### Dependencies

Required for DB API connectivity:

* `python2.6` or `python2.7`

* `six`

* `thrift>=0.8` (Python package only; no need for code-gen)

Required for UDFs:

* `numba<=0.13.4` (which has a few requirements, like LLVM)

* `boost` (because `udf.h` depends on `boost/cstdint.hpp`)

Required for SQLAlchemy integration (and Blaze):

* `sqlalchemy`

Required for `BigDataFrame`:

* `pandas`

Required for utilizing automated shipping/registering of code/UDFs/BDFs/etc:

* `pywebhdfs`

For manipulating results as pandas `DataFrame`s, we recommend installing pandas
regardless.

Generally, we recommend installing all the libraries above; the UDF libraries
will be the most difficult, and are not required if you will not use any Python
UDFs. Interacting with Impala using the `ImpalaContext` will simplify shipping
data and will perform cleanup on temporary data/tables.

This project is installed with `setuptools`.

### Installation

Install the latest release (`0.9.0`) with `pip`:

```bash
pip install impyla
```

For the latest (dev) version, clone the repo:

```bash
git clone https://github.com/cloudera/impyla.git
cd impyla
make # optional: only for Numba-compiled UDFs; requires LLVM/clang
python setup.py install
```

#### Running the tests

impyla uses the [pytest][pytest] toolchain, and depends on the following environment
variables:

```bash
export IMPALA_HOST=your.impalad.com
# beeswax might work here too
export IMPALA_PORT=21050
export IMPALA_PROTOCOL=hiveserver2
# needed to push data to the cluster
export NAMENODE_HOST=bottou01-10g.pa.cloudera.com
export WEBHDFS_PORT=50070
```

To run the maximal set of tests, run

```bash
py.test --dbapi-compliance path/to/impyla/impala/tests
```

Leave out the `--dbapi-compliance` option to skip tests for DB API compliance.
Add a `--udf` option to only run local UDF compilation tests.


### Quickstart

Impyla implements the [Python DB API v2.0 (PEP 249)][pep249] database interface
(refer to it for API details):

```python
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
```

**Note**: if connecting to Impala through the *HiveServer2* service, make sure
to set the port to the HiveServer2 port (defaults to 21050 in CM), not Beeswax
(defaults to 21000) which is what the Impala shell uses.

The `Cursor` object also supports the iterator interface, which is buffered
(controlled by `cursor.arraysize`):

```python
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
process(row)
```

You can also get back a pandas DataFrame object

```python
from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example
```


[pep249]: http://legacy.python.org/dev/peps/pep-0249/
[pandas]: http://pandas.pydata.org/
[sklearn]: http://scikit-learn.org/
[matplotlib]: http://matplotlib.org/
[madlib]: http://madlib.net/
[madlibport]: https://github.com/bitfort/madlibport
[numba]: http://numba.pydata.org/
[llvm]: http://llvm.org/
[pytest]: http://pytest.org/latest/

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

impyla-0.9.0.tar.gz (145.0 kB view details)

Uploaded Source

Built Distribution

impyla-0.9.0-py2.7.egg (433.5 kB view details)

Uploaded Egg

File details

Details for the file impyla-0.9.0.tar.gz.

File metadata

  • Download URL: impyla-0.9.0.tar.gz
  • Upload date:
  • Size: 145.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for impyla-0.9.0.tar.gz
Algorithm Hash digest
SHA256 8849fecadaa2e5241d708c4a45f0fbae33ff58f549a11b3c74c0531db9229664
MD5 e4a40f2fc608783c4d9cb06af9c4d5e8
BLAKE2b-256 3e952ac86c7e1af4b25111ae3779b559215d3118c5a70cc30de1ed81967a3517

See more details on using hashes here.

File details

Details for the file impyla-0.9.0-py2.7.egg.

File metadata

  • Download URL: impyla-0.9.0-py2.7.egg
  • Upload date:
  • Size: 433.5 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for impyla-0.9.0-py2.7.egg
Algorithm Hash digest
SHA256 f16b6573ef5b53f64d2b3618360dbf41a46a18b0eccd490cd05e5f14a025b940
MD5 d299eb77b9db0f4eb697ef70ca16c1c5
BLAKE2b-256 5b2e29e343d6f3a81188b8e5f3d6b7da0bb397b5f7e83cacac0425eaeacd66cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page