Unification of data connectors for distributed data tasks
Project description
Tentaclio
Python package regrouping a collection of I/O connectors, used in the data world with the aim of providing:
- a boilerplate for developers to expose new connectors (
tentaclio.clients
). - an interface to access file resources,
- thanks to a unified syntax (
tentaclio.open
), - and a simplified interface (
tentaclio.protocols
).
- thanks to a unified syntax (
Quickstart
Make sure Homebrew is installed and ensure it's up to date.
$ git clone git@github.com:octoenergy/tentaclio.git
Local installation
Similarly to the consumer-site, the library must be deployed onto a machine running:
- Python3
- a C compiler (either `gcc` via Homebrew, or `xcode` via the App store)
$ brew install pyenv
$ brew install pipenv
Lock the Python dependencies and build a virtualenv,
$ make update
To refresh Python dependencies,
$ make sync
How to use
This is how to use tentaclio
for your daily data ingestion and storing needs.
Streams
In order to open streams to load or store data the universal function is:
import tentaclio
with tentaclio.open("/path/to/my/file") as reader:
contents = reader.read()
with tentaclio.open("s3://bucket/file", mode='w') as writer:
writer.write(contents)
Allowed modes are r
, w
, rb
, and wb
. You can use t
instead of b
to indicate text streams, but that's the default.
The supported url protocols are:
/local/file
file:///local/file
s3://bucket/file
ftp://path/to/file
sftp://path/to/file
http://host.com/path/to/resource
https://host.com/path/to/resource
postgresql://host/database::table
will allow you to write from a csv format into a database with the same column names (note that the table goes after::
:warning:).
You can add the credentials for any of the urls in order to access protected resources.
You can use these readers and writers with pandas functions like:
import pandas as pd
import tentaclio
with tentaclio.open("/path/to/my/file") as reader:
df = pd.read_csv(reader)
[...]
with tentaclio.open("s3::/path/to/my/file", mode='w') as writer:
df.to_parquet(writer)
Readers
, Writers
and their closeable versions can be used anywhere expecting a file-like object; pandas or pickle are examples of such functions.
Database access
In order to open db connections you can use tentaclio.db
and have instant access to postgres, sqlite, athena and mssql.
import tentaclio
[...]
query = "select 1";
with tentaclio.db(POSTGRES_TEST_URL) as client:
result =client.query(query)
[...]
The supported db schemes are:
postgresql://
sqlite://
awsathena+rest://
mssql://
Automatic credentials injection
-
Configure credentials by using environmental variables prefixed with
TENTACLIO__CONN__
(i.e.TENTACLIO__CONN__DATA_FTP=sfpt://real_user:132ldsf@octoenergy.systems
). -
Open a stream:
with tentaclio.open("sftp://octoenergy.com/file.csv") as reader:
reader.read()
The credentials get injected into the url.
- Open a db client:
import tentaclio
with tentaclio.db("postgresql://hostname/my_data_base") as client:
client.query("select 1")
Note that hostname
in the url to be authenticated is a wildcard that will match any hostname. So authenticate("http://hostname/file.txt")
will be injected to http://user:pass@octo.co/file.txt
if the credential for http://user:pass@octo.co/
exists.
Different components of the URL are set differently:
- Scheme and path will be set from the URL, and null if missing.
- Username, password and hostname will be set from the stored credentials.
- Port will be set from the stored credentials if it exists, otherwise from the URL.
- Query will be set from the URL if it exists, otherwise from the stored credentials (so it can be overriden)
Credentials file
You can also set a credentials file that looks like:
secrets:
db_1: postgresql://user1:pass1@myhost.com/database_1
db_2: postgresql://user2:pass2@otherhost.com/database_2
ftp_server: ftp://fuser:fpass@ftp.myhost.com
And make it accessible to tentaclio by setting the environmental variable TENTACLIO__SECRETS_FILE
. The actual name of each url is for traceability and has no effect in the functionality.
Development
Testing
Tests run via py.test
:
$ make unit
$ make integration
:warning: Unit and integration tests will require a .env
in this directory with the following contents: :warning:
POSTGRES_TEST_URL=scheme://username:password@hostname:port/database
And linting is taken care of by flake8
and mypy
:
$ make lint
CircleCI
Continuous integration is run on CircleCI, with the following steps:
$ make circleci
Quick note on protocols
In order to abstract concrete dependencies from the implementation of data related functions (or in any part of the system really) we recommend to use Protocols. This allows a more flexible injection than using subclassing or more complex approches. This idea is heavily inspired by how this exact thing is done in go.
Simple protocol example
Let's suppose that we are going to write a function that loads a csv file, does some operation, and saves the result.
import pandas as pd
def sum(input_file: str, output_file: str) -> None:
df = pd.read_csv(input_file, index="index")
transformed_df = _transform(df)
pd.to_csv(output_file, transformed_df)
This has the following caveats:
- The source and destination of the data are bound to be a file in the local system, we can't support other streams such as s3,
io.StringIO
, orio.BytesIO
. - Testing is difficult and cumbersome as you need actual files for test the whole execution path.
Many panda's io functions allow the file argument (i.e. input_file) to be a string or a buffer, namely anything that contains a read method. This is known as a protocol in python.
Another protocols are Sized
, any object that contains a __len__
method, or __getitem__
. Protocols are usually loosely bound to the receiver, and it's its respisability to check programmatically that the argument contains in fact that method.
We could refactor this piece of code using the classes from the io
package to make it more general as they acutally implement the read
protocol.
import pandas as pd
from io import RawIOBase
def sum(input_file: RawIOBase, output_file: RawIOBase) -> None: # this won't work by the way
df = pd.read_csv(input_file, index="index")
transformed_df = _transform(df)
pd.to_csv(output_file, transformed_df)
Now, the io
package is a bit of a mess, it defines different classes such as TextIOBase
, StringIO
, FileIO
which are similar but incompatible when it comes to typing due the differences between strings and bytes.
If you want to use a StringIO
or FileIO
as argument to the same typed function, the whole process becomes an ordeal. Not only that, imagine that you want to implement a custom reader for data stored in a db, you will have to add a bunch of useless methods if you inherit from things like IOBase
, as we are only interested in the read method.
In order to have a more neat typed function that actually requires a read function we can use the typing_extensions package to create Protocols
.
from abc import abstractmethod
from typing_extensions import Protocol
class Reader(Protocol):
@abstractmethod
def read(self, i: int = -1):
pass
class Writer(Protocol):
@abstractmethod
def write(self, content) -> int:
pass
Notice how cheekily the methods above are not typed to allow strings and bytes to be sent in and out.
Our function will look like something like this:
import pandas as pd
from tentaclio import Reader, Writer
def sum(reader: Reader, writer: Writer) -> None:
df = pd.read_csv(reader, index="index")
transformed_df = _transform(df)
pd.to_csv(writer, transformed_df)
In the new signature we force our input just to have a read
method, likewise the output just needs a write
method.
Why is this cool?
- Now we can accept anything that fulfills the protocol expected by pandas while we are checking its type.
- When creating new readers, we don't need to implement redundant methods to match any of the
io
base types. - Testing becomes less cumbersome as we can send a
StringIO
rather than an actual file, or create some kind of fake class that has aread
method.
Caveats:
- the typing of the
pickle.dump
function is not consistent with its documentation and actual implementation, so you'll have to comment# type: ignore
in order to use aWriter
when callingdump
.
Pandas functions compatible with our Reader and Writer protocols
Anything that expects a filepath_or_buffer. The full list of io functions for pandas is here, although they are not fully documented, i.e. parquet works even though it's not documented.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.