Airbyte made easy (no UI, no database, no cluster)
Project description
Airbyte made simple
Why airbyte_serverless
?
At Unytics, we ❤️ Airbyte which provides a catalog of open-source connectors to move your data from any source to your data-warehouse.
Airbyte Open-Source Platform is "batteries included" 🔋.
You'll get a server, workers, database, UI, orchestrator, connectors, secret manager, logs manager, etc. All of this is very well packaged and deployable on Kubernetes. While we believe this is great for most people we strive for lightweight and simple assets to deploy and maintain. What's more, we ❤️serverless
.👉 We wanted a simple tool to manage Airbyte connectors, run them locally or deploy them in serverless mode.
Airbyte Open-Source Platform
vs airbyte_serverless
💡 Airbyte Serverless is less than Airbyte Open-Source Platform
Airbyte Open-Source Platform | Airbyte Serverless |
---|---|
Deployed on a VM or Kubernetes Cluster | Deployed with Serverless - Each Airbyte source docker image is upgraded with a destination connector from airbyte_serverless - Each upgraded docker image can then be deployed as an isolated Cloud Run Job (or Cloud Run Service )- Cloud Run is natively monitored with metrics, dashboards, logs, error reporting, alerting, etc - They can be scheduled or triggred upon cloud events |
Has a database | Has NO database - The destination stores the state (the track of where sync stops)- The destination stores the logs which can then be visualized with your preferred BI tool - Connectors Configurations can be stored in config files and versionned in git |
Has a UI | Has NO UI Configurations are generated as documented-yaml-files that one can edit and version |
Has a transform layer Airbyte loads your data in a raw format but then enables you to perform basic transform such as replace, upsert, schema normalization |
Has NO transform layer Data is appended in your destination in raw format. We believe less is more. airbyte_serverless is dedicated to do one thing and do it well: Extract-Load . It's easier to maintain and to evolve. |
Is scalable with conditions Scalable if deployed on autoscaled Kubernetes Cluster and if you are skilled enough. 👉 Check that you are skilled enough with Kubernetes by watching this video 😁. |
Is scalable Each connector is deployed independently of each other. You can have as many as you want. |
Features
- ⚡ A lightweight python wrapper around any Airbyte Source executable.
- ⚡ Destination Connectors (only BigQuery for now - contibutions are welcome 🤗) which store
logs
andstates
in addition to data. Thus, there is no need for a database any more!- ⚡ Examples to deploy to serverless compute (only Google Cloud Run for now - contibutions are welcome 🤗)
Getting Started
1. Install
pip install airbyte-serverless
2. Create an Airbyte Source from a public docker image
⚠️ For this to work you need to have docker in your machine
Write the following code in a file getting_started.py
(change surveymonkey
with the source you want). Then, you can run it with python getting_started.py
. If it does not work, please raise an issue. You can list public airbyte source docker images here
from airbyte_serverless.sources import DockerAirbyteSource
docker_image = 'airbyte/source-surveymonkey:latest'
source = AirbyteSource(docker_image)
If you don't have docker (or don't want to use it)
It is also possible to clone airbyte repo and install a python source connector:
- Clone the repo
- Go to the directory of the connector:
cd airbyte-integrations/connectors/source-surveymonkey
- Install the python connector
pip install -r requirements.txt
- Create here a file
getting_started.py
but with the following content:from airbyte_serverless.sources import AirbyteSource airbyte_source_executable = 'python main.py' source = AirbyteSource(airbyte_source_executable)
3. Update config
for your Airbyte Source
Your Airbyte Source needs some config to be able to connect. Show a pre-filled config
for your connector with:
print(source.config)
Copy the content, edit it and update the variable:
source.config = '''
YOUR UPDATED CONFIG
'''
4. Check your config
print(source.connection_status)
5. Update configured_catalog
for your Airbyte Source
The source catalog
lists the available streams
(think entities) that the source is able to retrieve. The configured_catalog
specifies which streams
to extract and how. Show the default configured_catalog
with:
print(source.configured_catalog)
If needed, copy the content, edit it and update the variable:
source.configured_catalog = {
...YOUR UPDATED CONFIG
}
6. Test the retrieval of one data record
print(source.first_record)
7. Create a destination and run Extract-Load
from airbyte_serverless.destinations import BigQueryDestination
destination = BigQueryDestination(dataset='YOUR-PROJECT.YOUR_DATASET')
data = source.extract()
destination.load(data)
8. Run Extract-Load from where you stopped
The state
keeps track from where the latest extract-load ended (for incremental extract-load).
To start from this state
run:
state = destination.get_state()
data = source.extract(state=state)
destination.load(data)
End to End Example
from airbyte_serverless.sources import DockerAirbyteSource
from airbyte_serverless.destinations import BigQueryDestination
docker_image = 'airbyte/source-surveymonkey:latest'
config = 'YOUR CONFIG'
configured_catalog = {YOUR CONFIGURED CATALOG}
source = DockerAirbyteSource(docker_image, config=config, configured_catalog=configured_catalog)
destination = BigQueryDestination(dataset='YOUR-PROJECT.YOUR_DATASET')
state = destination.get_state()
data = source.extract(state=state)
destination.load(data)
Deploy
To deploy to Cloud Run job, edit Dockerfile to pick the Airbyte source you like then run:
Limitations
- BigQuery Destination connector only works in append mode
- Data at destination is in raw format. No data parsing is done.
We believe, like Airbyte, that it is a good thing to decouple data moving and data transformation. To shape your data you may want to use a tool such as dbt. Thus, we follow the EL-T philosophy.
Credits
The generation of the sample connector configuration in yaml is heavily inspired from the code of octavia
CLI developed by airbyte.
Contribute
Any contribution is more than welcome 🤗!
- Add a ⭐ on the repo to show your support
- Raise an issue to raise a bug or suggest improvements
- Open a PR! Below are some suggestions of work to be done:
- improve secrets management
- implement a CLI
- manage configurations as yaml files
- implement the
get_logs
method ofBigQueryDestination
- add a new destination connector (Cloud Storage?)
- add more serverless deployment examples.
- implement optional post-processing (replace, upsert data at destination instead of append?)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.