Python interface to StellarDB
Project description
PyStellarDB
PyStellarDB is a Python API for executing Transwarp Exetended OpenCypher(TEoC) and Hive query. It could also generate a RDD object which could be used in PySpark. It is base on PyHive(https://github.com/dropbox/PyHive) and PySpark(https://github.com/apache/spark/)
PySpark RDD
We hack a way to generate RDD object using the same method in sc.parallelize(data). It could cause memory panic if the query returns a large amount of data.
Users could use a workaround if you do need huge data:
If you are querying a graph, refer to StellarDB manual of Chapter 4.4.5 to save the query data into a temporary table.
If you are querying a SQL table, save your query result into a temporary table.
Find the HDFS path of the temporary table generated in Step 1 or Step 2.
Use API like sc.newAPIHadoopFile() to generate RDD.
Usage
PLAIN Mode (No security is configured)
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
LDAP Mode
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, username='hive', password='123456', auth='LDAP', graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
Kerberos Mode
# Make sure you have the correct realms infomation about the KDC server in /etc/krb5.conf
# Make sure you have the correct keytab file in your environment
# Run kinit command:
# In Linux: kinit -kt FILE_PATH_OF_KEYTABL PRINCIPAL_NAME
# In Mac: kinit -t FILE_PATH_OF_KEYTABL -f PRINCIPAL_NAME
from pystellardb import stellar_hive
conn = stellar_hive.StellarConnection(host="localhost", port=10000, kerberos_service_name='hive', auth='KERBEROS', graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 1')
print cur.fetchall()
Execute Hive Query
from pystellardb import stellar_hive
# If `graph_name` parameter is None, it will execute a Hive query and return data just as PyHive does
conn = stellar_hive.StellarConnection(host="localhost", port=10000, database='default')
cur = conn.cursor()
cur.execute('SELECT * FROM default.abc limit 10')
Execute Graph Query and change to a PySpark RDD object
from pyspark import SparkContext
from pystellardb import stellar_hive
sc = SparkContext("local", "Demo App")
conn = stellar_hive.StellarConnection(host="localhost", port=10000, graph_name='pokemon')
cur = conn.cursor()
cur.execute('config query.lang cypher')
cur.execute('use graph pokemon')
cur.execute('match p = (a)-[f]->(b) return a,f,b limit 10')
rdd = cur.toRDD(sc)
def f(x): print(x)
rdd.map(lambda x: (x[0].toJSON(), x[1].toJSON(), x[2].toJSON())).foreach(f)
# Every line of this query is in format of Tuple(VertexObject, EdgeObject, VertexObject)
# Vertex and Edge object has a function of toJSON() which can print the object in JSON format
Execute Hive Query and change to a PySpark RDD object
from pyspark import SparkContext
from pystellardb import stellar_hive
sc = SparkContext("local", "Demo App")
conn = stellar_hive.StellarConnection(host="localhost", port=10000)
cur = conn.cursor()
cur.execute('select * from default_db.default_table limit 10')
rdd = cur.toRDD(sc)
def f(x): print(x)
rdd.foreach(f)
# Every line of this query is in format of Tuple(Column, Column, Column)
Dependencies
Required:
Python 2.7+ / Less than Python 3.7
System SASL
Different systems require different packages to be installed to enable SASL support. Some examples of how to install the packages on different distributions follow.
Ubuntu:
apt-get install libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit
apt-get install python-dev gcc #Update python and gcc if needed
RHEL/CentOS:
yum install cyrus-sasl-md5 cyrus-sasl-plain cyrus-sasl-gssapi cyrus-sasl-devel
yum install gcc-c++ python-devel.x86_64 #Update python and gcc if needed
# If your Python environment is 3.X, then you may need to compile and reinstall Python
# if pip3 install fails with a message like 'Can't connect to HTTPS URL because the SSL module is not available'
# 1. Download a higher version of openssl, e.g: https://www.openssl.org/source/openssl-1.1.1k.tar.gz
# 2. Install openssl: ./config && make && make install
# 3. Link openssl: echo /usr/local/lib64/ > /etc/ld.so.conf.d/openssl-1.1.1.conf
# 4. Update dynamic lib: ldconfig -v
# 5. Download a Python source package
# 6. vim Modules/Setup, search '_socket socketmodule.c', uncomment
# _socket socketmodule.c
# SSL=/usr/local/ssl
# _ssl _ssl.c \
# -DUSE_SSL -I$(SSL)/include -I$(SSL)/include/openssl \
# -L$(SSL)/lib -lssl -lcrypto
#
# 7. Install Python: ./configure && make && make install
Windows:
# There are 3 ways of installing sasl for python on windows
# 1. (recommended) Download a .whl version of sasl from https://www.lfd.uci.edu/~gohlke/pythonlibs/#sasl
# 2. (recommended) If using anaconda, use conda install sasl.
# 3. Install Microsoft Visual C++ 9.0/14.0 buildtools for python2.7/3.x, then pip install sasl(under test).
Notices
If you install pystellardb >= 0.9, then it will install a beeline command into system. Delete /usr/local/bin/beeline if you don’t need it.
Requirements
Install using
pip install 'pystellardb[hive]' for the Hive interface.
PyHive works with
For Hive: HiveServer2 daemon
Windows Kerberos Configuration
If you’re connecting to databases using Kerberos authentication from Windows platform, you’ll need to install & configure Kerberos for Windows first. Get it from http://web.mit.edu/kerberos/dist/
After installation, configure the environment variables. Make sure your Kerberos variable is set ahead of JDK variable(If you have JDK), because JDK also has kinit etc.
Find /etc/krb5.conf on your KDC, copy it into krb5.ini on Windows with some modifications. e.g.(krb5.conf on KDC):
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = DEFAULT
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
allow_weak_crypto = true
udp_preference_limit = 32700
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
[realms]
DEFAULT = {
kdc = host1:1088
kdc = host2:1088
}
Modify it, delete [logging] and default_ccache_name in [libdefaults]:
[libdefaults]
default_realm = DEFAULT
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
allow_weak_crypto = true
udp_preference_limit = 32700
[realms]
DEFAULT = {
kdc = host1:1088
kdc = host2:1088
}
This is your krb5.ini for Windows Kerberos. Put it at those 3 places:
C:ProgramDataMITKerberos5krb5.ini
C:Program FilesMITKerberoskrb5.ini
C:Windowskrb5.ini
Finally, configure hosts at: C:/Windows/System32/drivers/etc/hosts Add ip mappings of host1, host2 in the previous example. e.g.
10.6.6.96 host1
10.6.6.97 host2
Now, you can run kinit in the command line!
Testing
On his way
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PyStellarDB-0.11.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ae23f2654b18e90459374323dc75785d6f1040a06f269e9e54d584899cdaef0 |
|
MD5 | 0e9e7cf2462878dbc294b838d5d8ea17 |
|
BLAKE2b-256 | 96a2755b98de685d74462915e97d33c7edfb088d387325d2dfb9b5a01d8bdb67 |