Skip to main content

Vertica-ML-Python simplifies data exploration, data cleaning and machine learning in Vertica.

Project description

Vertica-ML-Python

The documentation is available at:
https://github.com/vertica/Vertica-ML-Python/blob/master/documentation.pdf

Or directly in the Wiki at:
https://github.com/vertica/Vertica-ML-Python/wiki

(c) Copyright [2018] Micro Focus or one of its affiliates. Licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

⚠ If you want to contribute, send a mail to badr.ouali@microfocus.com

Vertica-ML-Python is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities. It supports the entire data science life cycle, uses a ‘pipeline’ mechanism to sequentialize data transformation operation (called Virtual Dataframe), and offers multiple graphical rendering possibilities.

The 'Big Data' (Tb of data) is now one of the main topics in the Data Science World. Data Scientists are now very important for any organisation. Becoming Data-Driven is mandatory to survive. Vertica is the first real analytic columnar Database and is still the fastest in the market. However, SQL is not enough flexible to be very popular for Data Scientists. Python flexibility is priceless and provides to any user a very nice experience. The level of abstraction is so high that it is enough to think about a function to notice that it already exists. Many Data Science APIs were created during the last 15 years and were directly adopted by the Data Science community (examples: pandas and scikit-learn). However, Python is only working in-memory for a single node process. Even if some famous highly distributed programming languages exist to face this challenge, they are still in-memory and most of the time they can not process on all the data. Besides, moving the data can become very expensive. Data Scientists must also find a way to deploy their data preparation and their models. We are far away from easiness and the entire process can become time expensive.

The idea behind VERTICA ML PYTHON is simple: Combining the Scalability of VERTICA with the Flexibility of Python to give to the community what they need Bringing the logic to the data and not the opposite. This version 1.0 is the work of 3 years of new ideas and improvement.

Main Advantages:

  • easy Data Exploration.
  • easy Data Preparation.
  • easy Data Modeling.
  • easy Model Evaluation.
  • easy Model Deployment.
  • most of what pandas.Dataframe can do, vertica_ml_python.vDataframe can do (and sometimes even more)
  • easy ML model creation and evaluation.
  • many scikit functions and algorithms are available (and scalable!).

All information related to the API can be found at:

https://github.com/vertica/Vertica-ML-Python/

Python Version

vertica-ml-python works with at least:

  • Vertica: 9.1 (with previous versions, some functions and algorithms may not be available)
  • Python: 3.6

Standard Libraries

vertica-ml-python library is only using the standard Python libraries such as matplotlib, numpy... Other libraries can be used as anytree for tree visualization or sqlparse for SQL indentation but they are optional.

Installation

To install vertica-ml-python, you can use the pip command:

root@ubuntu:~$ pip3 install vertica_ml_python

Or you can get a copy of the source by cloning from the Vertica-ML-Python github project and install with:

root@ubuntu:~$ python3 setup.py install

You can also drag and drop the vertica_ml_python folder in the site-package folder of the Python framework. In the MAC environment, you can find it in:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages

Another way is to call the library from where it is located.

You can then import each library element using the usual Python syntax.

# to import the vDataframe
from vertica_ml_python import vDataframe
# to import the Logistic Regression
from vertica_ml_python.learn.linear_model import LogisticRegression

Everything is well detailed in the following documentation.

Connection to the Database

This step is useless if vertica-python or pyodbc is already installed and you have a DSN in your machine. With this configuration, you do not need to manually create a cursor. It is possible to create a vDataframe using directly the DSN (dsn parameter of the vDataframe).

ODBC

To connect to the database, the user can use an ODBC connection to the Vertica database. vertica-python and pyodbc provide a cursor that will point to the database. It will be used by the vertica-ml-python to create all the different objects.

#
# vertica_python
#
import vertica_python

# Connection using all the DSN information
conn_info = {'host': "10.211.55.14", 'port': 5433, 'user': "dbadmin", 'password': "XxX", 'database': "testdb"}
cur = vertica_python.connect(** conn_info).cursor()

# Connection using directly the DSN
from vertica_ml_python.utilities import to_vertica_python_format # This function will parse the odbc.ini file
dsn = "VerticaDSN"
cur = vertica_python.connect(** to_vertica_python_format(dsn)).cursor()

#
# pyodbc
#
import pyodbc

# Connection using all the DSN information
driver = "/Library/Vertica/ODBC/lib/libverticaodbc.dylib"
server = "10.211.55.14"
database = "testdb"
port = "5433"
uid = "dbadmin"
pwd = "XxX"
dsn = ("DRIVER={}; SERVER={}; DATABASE={}; PORT={}; UID={}; PWD={};").format(driver, server, database, port, uid, pwd)
cur = pyodbc.connect(dsn).cursor()

# Connection using directly the DSN
dsn = ("DSN=VerticaDSN")
cur = pyodbc.connect(dsn).cursor()

JDBC

The user can also use a JDBC connection to the Vertica Database.

import jaydebeapi

uid = "dbadmin"
pwd = "XxX"
driver = "/Library/Vertica/JDBC/vertica-jdbc-9.0.1-0.jar" #Path to JDBC Driver
url = 'jdbc:vertica://10.211.55.14:5433/'
name = 'com.vertica.jdbc.Driver'
cur = jaydebeapi.connect(name, [url, uid, pwd], driver).cursor()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for vertica-ml-python, version 1.0.post2
Filename, size File type Python version Upload date Hashes
Filename, size vertica_ml_python-1.0.post2-py3-none-any.whl (332.8 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size vertica_ml_python-1.0.post2.tar.gz (71.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page