Skip to main content

turbodbc is a Python DB API 2.0 compatible ODBC driver

Project description

![turbodbc logo](/page/logo.png?raw=true "turbodbc logo")

Turbodbc - Turbocharged database access for data scientists.
============================================================

[![Build Status](https://travis-ci.org/blue-yonder/turbodbc.svg?branch=master)](https://travis-ci.org/blue-yonder/turbodbc)

Turbodbc is a Python module to access relational databases via the Open Database
Connectivity (ODBC) interface. In addition to complying with the Python Database API
Specification 2.0, turbodbc offers built-in NumPy support. Don't wait minutes for your
results, just blink.

Features
--------

* Bulk retrieval of result sets
* Built-in NumPy conversion
* Bulk transfer of query parameters
* Asynchronous I/O for result sets
* Automatic conversion of decimal type to integer, float, and string as
appropriate
* Supported data types for both result sets and parameters:
`int`, `float`, `str`, `bool`, `datetime.date`, `datetime.datetime`
* Also provides a high-level C++11 database driver under the hood
* Tested with Python 2.7, 3.4, 3.5, and 3.6
* Tested on Linux and OSX


Installation
------------

To install turbodbc, please use the following command:

pip install turbodbc

If you want to leverage turbodbc's NumPy support, please make sure to `pip install numpy`
before installing turbodbc, since turbodbc searches for NumPy headers
at installation time to determine whether NumPy support can be provided.

Since turbodbc includes C-extensions, make sure the following prerequisites
are given:

Requirement | Linux (`apt-get install`) | OSX (`brew install`)
----------------------------------|------------------------------|----------------------
C++-compiler with C++11 support | G++-4.8 or higher | clang with OSX 10.9+
Boost library + headers* | `libboost-all-dev` | `boost`
Unixodbc library + headers | `unixodbc-dev` | `unixodbc`
Python headers | `python-dev` | use `pyenv` to install

*) The minimum viable boost setup requires the libraries `variant`, `optional`,
and `datetime`.



Why should I use turbodbc instead of other ODBC modules?
--------------------------------------------------------

Short answer: turbodbc is faster.

Slightly longer answer: turbodbc is faster, _much_ faster if you want to
work with NumPy.

Medium-length answer: I have tested turbodbc and pyodbc (probably the most
popular Python ODBC module) with various databases (Exasol, PostgreSQL, MySQL)
and corresponding ODBC drivers. I found turbodbc to be consistently faster.

For retrieving result sets, I found speedups between 1.5 and 7 retrieving plain
Python objects. For inserting data, I found speedups of up to 100.

Is this completely scientific? Not at all. I have not told you about which
hardware I used, which operating systems, drivers, database versions, network
bandwidth, database layout, SQL queries, what is measured, and how I performed
was measured.

All I can tell you is that if I exchange pyodbc with turbodbc, my benchmarks
took less time, often approaching one (reading) or two (writing) orders of
magnitude. Give it a spin for yourself, and tell me if you liked it.


Smooth. What is the trick?
--------------------------

Turbodbc exploits buffering.

* Turbodbc implements both sending parameters and retrieving result sets using
buffers of multiple rows/parameter sets. This avoids round trips to the ODBC
driver and (depending how well the ODBC driver is written) to the database.
* Multiple buffers are used for asynchronous I/O. This allows to interleave
Python object conversion and direct database interaction (see performance options
below).
* Buffers contain binary representations of data. NumPy arrays contain binary
representations of data. Good thing they are often the same, so instead of
converting we can just copy data.


Supported data types
--------------------

The following data types are supported:

Database type(s) | Python type | NumPy type | Notes
----------------------------------|------------------|---------------|-------
integers, Decimal(<19,0) | `int` | `int64` |
floating point, Decimal(x, >0) | `float` | `float64` |
bit, boolean-like | `bool` | `bool_` |
timestamp, time | `datetime.datetime` | `datetime64[us]` |
date | `datetime.date` | `datetime64[D]` |
strings, VARCHAR, Decimal(>18, 0) | `unicode` | `object_` | _slow, WIP_

NumPy types are not yet supported for parameters.


Basic usage
-----------

Turbodbc follows the specification of the Python database API v2, which you can
find at https://www.python.org/dev/peps/pep-0249/. Here is a short summary,
including the parts not specified.

To establish a connection, use any of the following commands:

>>> from turbodbc import connect
>>> connection = connect(dsn='My data source name as given by odbc.ini')
>>> connection = connect(dsn='my dsn', user='my user has precedence')
>>> connection = connect(dsn='my dsn', username='field names may depend on the driver')

To execute a query, you need a `cursor` object:

>>> cursor = connection.cursor()

Here is how to execute a `SELECT` query:

>>> cursor.execute('SELECT 42')
>>> for row in cursor:
>>> print list(row)

Here is how to execute an `INSERT` query with many parameters:

>>> parameter_sets = [['hi', 42],
['there', 23]]
>>> cursor.executemany('INSERT INTO my_table VALUES (?, ?)',
parameter_sets)


NumPy support
-------------

Here is how to retrieve a result set in the form of NumPy arrays:

>>> cursor.execute("SELECT A, B FROM my_table")
>>> cursor.fetchallnumpy()
OrderedDict([('A', masked_array(data = [42 --],
mask = [False True],
fill_value = 999999)),
('B', masked_array(data = [3.14 2.71],
mask = [False False],
fill_value = 1e+20))])

Please note a few things:

* The return value is an `OrderedDict` of column name/value pairs. The column
order is the same as in your query.
* The column values are `MaskedArray`s. Any `NULL` values you have in your
database will show up as masked entries (`NULL` values in string-like columns
will shop up as `None` objects).
* NumPy support is limited to result sets, experimental, and will probably change
with the next iterations.

Performance options
-------------------

Turbodbc offers some options to tune the performance for your database:

>>> from turbodbc import Megabytes
>>> connect(dsn="my dsn",
read_buffer_size=Megabytes(100),
parameter_sets_to_buffer=5000,
use_async_io=True)

`read_buffer_size` affects how many result set rows are retrieved per batch
of results. Set the attribute to `turbodbc.Megabytes(42)` to have turbodbc determine
the optimal number of rows per batch so that the total buffer amounts to
42 MB. This is recommended for most users and databases. You can also set
the attribute to `turbodbc.Rows(13)` if you would like to fetch results in
batches of 13 rows. By default, turbodbc fetches results in batches of 20 MB.

Similarly, `parameter_sets_to_buffer` changes the number of parameter sets
which are transferred per batch of parameters (e.g., as sent with `executemany()`).
Please note that it is not (yet) possible to use the `Megabytes` and `Rows` classes
here.

Finally, set `use_async_io` to `True` if you would like to benefit from
asynchronous I/O operations (limited to result sets for the time being).
Asynchronous I/O means that while the main thread converts result set rows
retrieved from the database to Python objects, another thread fetches a
new batch of results from the database in the background. This may yield
significant speedups when retrieving and converting are similarly fast
operations.

Asynchronous I/O is experimental and has to fully prove itself yet.
Don't be afraid to give it a try, though.


Development version
-------------------

To use the latest version of turbodbc, you need to follow these steps:

* Get the source code from github
* Check the source build requirements (see below) are installed on your computer
* Create a build directory. Make this your working directory.
* Execute the following command:

cmake -DCMAKE_INSTALL_PREFIX=./dist /path/to/source/directory

This will prepare the build directory for the actual build step.

* Execute the `make` command to build the code.
* You can execute the tests with `ctest`.
* To create a Python source distribution for simple installation, use
the following commands:

make install
cd dist
python setup.py sdist

This will create a `.tar.gz` file in the folder `dist/dist` in your
build directory. This file is self-contained and can be installed by
other users using `pip install`.

Source build requirements
-------------------------

For the development build, you also require the following additional
dependencies:

* CMake


Supported environments
----------------------

* Linux (successfully built on Ubuntu 12, Ubuntu 14, Debian 7, Debian 8)
* OSX (successfully built on Sierra a.k.a. 10.12 and El Capitan a.k.a. 10.11)
* Python 2.7, 3.4, 3.5, 3.6
* More environments probably work as well, but these are the versions that
are regularly tested on Travis or local development machines.


Supported databases
-------------------

Turbodbc uses suites of unit and integration tests to ensure quality.
Every time turbodbc's code is updated on GitHub,
turbodbc is automatically built from scratch and tested with the following databases:

* PostgreSQL
* MySQL

During development, turbodbc is tested with the following database:

* Exasol

Releases will not be made if any (implemented) test fails for any of the databases
listed above. In addition to these well-supported databases, the following databases
are tested on an irregular basis:

* MSSQL with FreeTDS
* MSSQL with Microsoft's official ODBC driver

These database/driver combinations do not yet pass all tests.

There is a good chance that turbodbc will work with other, totally untested databases
as well. There is, however, an equally good chance that you will encounter compatibility
issues. If you encounter one, please take the time to report it so turbodbc can be improved
to work with more real-world databases. Thanks!


I got questions and issues to report!
-------------------------------------

In this case, please use turbodbc's issue tracker on GitHub.


Is there a guided tour through turbodbc's entrails?
---------------------------------------------------

Yes, there is! Check out this blog post on
[the making of turbodbc](http://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/).


Is turbodbc on Twitter?
-----------------------

Yes, it is! Just follow [@turbodbc](https://twitter.com/turbodbc)
for the latest turbodbc talk and news about related technologies.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbodbc-1.0.4.tar.gz (55.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page