Python client for the Impala distributed query engine
Project description
# impyla
Python client for the Impala distributed query engine.
### Features
Fully supported:
* Lightweight, `pip`-installable package for connecting to Impala databases
* Fully [DB API 2.0 (PEP 249)][pep249]-compliant Python client (similar to
sqlite or MySQL clients)
* Support for HiveServer2 and Beeswax; support for Kerberos
* Converter to [pandas][pandas] `DataFrame`, allowing easy integration into the
Python data stack (including [scikit-learn][sklearn] and
[matplotlib][matplotlib])
In various phases of maturity:
* SQLAlchemy connector; integration with Blaze
* `BigDataFrame` abstraction for performing `pandas`-style analytics on large
datasets (similar to Spark's RDD abstraction); computation is pushed into the
Impala engine.
* `scikit-learn`-flavored wrapper for [MADlib][madlib]-style prediction,
allowing for large-scale, distributed machine learning (see
[the Impala port of MADlib][madlibport])
* Compiling UDFs written in Python into low-level machine code for execution by
Impala (powered by [Numba][numba]/[LLVM][llvm])
### Dependencies
Required for DB API connectivity:
* `python2.6` or `python2.7`
* `six`
* `thrift>=0.8` (Python package only; no need for code-gen)
Required for UDFs:
* `numba<=0.13.4` (which has a few requirements, like LLVM)
* `boost` (because `udf.h` depends on `boost/cstdint.hpp`)
Required for SQLAlchemy integration (and Blaze):
* `sqlalchemy`
Required for `BigDataFrame`:
* `pandas`
Required for utilizing automated shipping/registering of code/UDFs/BDFs/etc:
* `pywebhdfs`
For manipulating results as pandas `DataFrame`s, we recommend installing pandas
regardless.
Generally, we recommend installing all the libraries above; the UDF libraries
will be the most difficult, and are not required if you will not use any Python
UDFs. Interacting with Impala using the `ImpalaContext` will simplify shipping
data and will perform cleanup on temporary data/tables.
This project is installed with `setuptools`.
### Installation
Install the latest release (`0.9.0`) with `pip`:
```bash
pip install impyla
```
For the latest (dev) version, clone the repo:
```bash
git clone https://github.com/cloudera/impyla.git
cd impyla
make # optional: only for Numba-compiled UDFs; requires LLVM/clang
python setup.py install
```
#### Running the tests
impyla uses the [pytest][pytest] toolchain, and depends on the following environment
variables:
```bash
export IMPALA_HOST=your.impalad.com
# beeswax might work here too
export IMPALA_PORT=21050
export IMPALA_PROTOCOL=hiveserver2
# needed to push data to the cluster
export NAMENODE_HOST=bottou01-10g.pa.cloudera.com
export WEBHDFS_PORT=50070
```
To run the maximal set of tests, run
```bash
py.test --dbapi-compliance path/to/impyla/impala/tests
```
Leave out the `--dbapi-compliance` option to skip tests for DB API compliance.
Add a `--udf` option to only run local UDF compilation tests.
### Quickstart
Impyla implements the [Python DB API v2.0 (PEP 249)][pep249] database interface
(refer to it for API details):
```python
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
```
**Note**: if connecting to Impala through the *HiveServer2* service, make sure
to set the port to the HiveServer2 port (defaults to 21050 in CM), not Beeswax
(defaults to 21000) which is what the Impala shell uses.
The `Cursor` object also supports the iterator interface, which is buffered
(controlled by `cursor.arraysize`):
```python
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
process(row)
```
You can also get back a pandas DataFrame object
```python
from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example
```
[pep249]: http://legacy.python.org/dev/peps/pep-0249/
[pandas]: http://pandas.pydata.org/
[sklearn]: http://scikit-learn.org/
[matplotlib]: http://matplotlib.org/
[madlib]: http://madlib.net/
[madlibport]: https://github.com/bitfort/madlibport
[numba]: http://numba.pydata.org/
[llvm]: http://llvm.org/
[pytest]: http://pytest.org/latest/
Python client for the Impala distributed query engine.
### Features
Fully supported:
* Lightweight, `pip`-installable package for connecting to Impala databases
* Fully [DB API 2.0 (PEP 249)][pep249]-compliant Python client (similar to
sqlite or MySQL clients)
* Support for HiveServer2 and Beeswax; support for Kerberos
* Converter to [pandas][pandas] `DataFrame`, allowing easy integration into the
Python data stack (including [scikit-learn][sklearn] and
[matplotlib][matplotlib])
In various phases of maturity:
* SQLAlchemy connector; integration with Blaze
* `BigDataFrame` abstraction for performing `pandas`-style analytics on large
datasets (similar to Spark's RDD abstraction); computation is pushed into the
Impala engine.
* `scikit-learn`-flavored wrapper for [MADlib][madlib]-style prediction,
allowing for large-scale, distributed machine learning (see
[the Impala port of MADlib][madlibport])
* Compiling UDFs written in Python into low-level machine code for execution by
Impala (powered by [Numba][numba]/[LLVM][llvm])
### Dependencies
Required for DB API connectivity:
* `python2.6` or `python2.7`
* `six`
* `thrift>=0.8` (Python package only; no need for code-gen)
Required for UDFs:
* `numba<=0.13.4` (which has a few requirements, like LLVM)
* `boost` (because `udf.h` depends on `boost/cstdint.hpp`)
Required for SQLAlchemy integration (and Blaze):
* `sqlalchemy`
Required for `BigDataFrame`:
* `pandas`
Required for utilizing automated shipping/registering of code/UDFs/BDFs/etc:
* `pywebhdfs`
For manipulating results as pandas `DataFrame`s, we recommend installing pandas
regardless.
Generally, we recommend installing all the libraries above; the UDF libraries
will be the most difficult, and are not required if you will not use any Python
UDFs. Interacting with Impala using the `ImpalaContext` will simplify shipping
data and will perform cleanup on temporary data/tables.
This project is installed with `setuptools`.
### Installation
Install the latest release (`0.9.0`) with `pip`:
```bash
pip install impyla
```
For the latest (dev) version, clone the repo:
```bash
git clone https://github.com/cloudera/impyla.git
cd impyla
make # optional: only for Numba-compiled UDFs; requires LLVM/clang
python setup.py install
```
#### Running the tests
impyla uses the [pytest][pytest] toolchain, and depends on the following environment
variables:
```bash
export IMPALA_HOST=your.impalad.com
# beeswax might work here too
export IMPALA_PORT=21050
export IMPALA_PROTOCOL=hiveserver2
# needed to push data to the cluster
export NAMENODE_HOST=bottou01-10g.pa.cloudera.com
export WEBHDFS_PORT=50070
```
To run the maximal set of tests, run
```bash
py.test --dbapi-compliance path/to/impyla/impala/tests
```
Leave out the `--dbapi-compliance` option to skip tests for DB API compliance.
Add a `--udf` option to only run local UDF compilation tests.
### Quickstart
Impyla implements the [Python DB API v2.0 (PEP 249)][pep249] database interface
(refer to it for API details):
```python
from impala.dbapi import connect
conn = connect(host='my.host.com', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM mytable LIMIT 100')
print cursor.description # prints the result set's schema
results = cursor.fetchall()
```
**Note**: if connecting to Impala through the *HiveServer2* service, make sure
to set the port to the HiveServer2 port (defaults to 21050 in CM), not Beeswax
(defaults to 21000) which is what the Impala shell uses.
The `Cursor` object also supports the iterator interface, which is buffered
(controlled by `cursor.arraysize`):
```python
cursor.execute('SELECT * FROM mytable LIMIT 100')
for row in cursor:
process(row)
```
You can also get back a pandas DataFrame object
```python
from impala.util import as_pandas
df = as_pandas(cur)
# carry df through scikit-learn, for example
```
[pep249]: http://legacy.python.org/dev/peps/pep-0249/
[pandas]: http://pandas.pydata.org/
[sklearn]: http://scikit-learn.org/
[matplotlib]: http://matplotlib.org/
[madlib]: http://madlib.net/
[madlibport]: https://github.com/bitfort/madlibport
[numba]: http://numba.pydata.org/
[llvm]: http://llvm.org/
[pytest]: http://pytest.org/latest/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
impyla-0.9.0.tar.gz
(145.0 kB
view details)
Built Distribution
impyla-0.9.0-py2.7.egg
(433.5 kB
view details)
File details
Details for the file impyla-0.9.0.tar.gz
.
File metadata
- Download URL: impyla-0.9.0.tar.gz
- Upload date:
- Size: 145.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8849fecadaa2e5241d708c4a45f0fbae33ff58f549a11b3c74c0531db9229664 |
|
MD5 | e4a40f2fc608783c4d9cb06af9c4d5e8 |
|
BLAKE2b-256 | 3e952ac86c7e1af4b25111ae3779b559215d3118c5a70cc30de1ed81967a3517 |
File details
Details for the file impyla-0.9.0-py2.7.egg
.
File metadata
- Download URL: impyla-0.9.0-py2.7.egg
- Upload date:
- Size: 433.5 kB
- Tags: Egg
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f16b6573ef5b53f64d2b3618360dbf41a46a18b0eccd490cd05e5f14a025b940 |
|
MD5 | d299eb77b9db0f4eb697ef70ca16c1c5 |
|
BLAKE2b-256 | 5b2e29e343d6f3a81188b8e5f3d6b7da0bb397b5f7e83cacac0425eaeacd66cf |