llama-index readers airbyte_cdk integration
Project description
Airbyte CDK Loader
The Airbyte CDK Loader is a shim for sources created using the Airbyte Python CDK. It allows you to load data from any Airbyte source into LlamaIndex.
Installation
- Install llama_hub:
pip install llama_hub
- Install airbyte-cdk:
pip install airbyte-cdk
- Install a source via git (or implement your own):
pip install git+https://github.com/airbytehq/airbyte.git@master#egg=source_github&subdirectory=airbyte-integrations/connectors/source-github
Usage
Implement and import your own source. You can find lots of resources for how to achieve this on the Airbyte documentation page.
Here's an example usage of the AirbyteCdkReader.
from llama_index import download_loader
from llama_hub.airbyte_cdk import AirbyteCDKReader
from source_github.source import (
SourceGithub,
) # this is just an example, you can use any source here - this one is loaded from the Airbyte Github repo via pip install git+https://github.com/airbytehq/airbyte.git@master#egg=source_github&subdirectory=airbyte-integrations/connectors/source-github`
github_config = {
# ...
}
reader = AirbyteCDKReader(source_class=SourceGithub, config=github_config)
documents = reader.load_data(stream_name="issues")
By default all fields are stored as metadata in the documents and the text is set to the JSON representation of all the fields. Construct the text of the document by passing a record_handler
to the reader:
def handle_record(record, id):
return Document(
doc_id=id, text=record.data["title"], extra_info=record.data
)
reader = AirbyteCDKReader(
source_class=SourceGithub,
config=github_config,
record_handler=handle_record,
)
Lazy loads
The reader.load_data
endpoint will collect all documents and return them as a list. If there are a large number of documents, this can cause issues. By using reader.lazy_load_data
instead, an iterator is returned which can be consumed document by document without the need to keep all documents in memory.
Incremental loads
If a stream supports it, this loader can be used to load data incrementally (only returning documents that weren't loaded last time or got updated in the meantime):
reader = AirbyteCDKReader(source_class=SourceGithub, config=github_config)
documents = reader.load_data(stream_name="issues")
current_state = reader.last_state # can be pickled away or stored otherwise
updated_documents = reader.load_data(
stream_name="issues", state=current_state
) # only loads documents that were updated since last time
This loader is designed to be used as a way to load data into LlamaIndex and/or subsequently used as a Tool in a LangChain Agent. See here for examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file flying_delta_readers_airbyte_cdk-0.1.0.tar.gz
.
File metadata
- Download URL: flying_delta_readers_airbyte_cdk-0.1.0.tar.gz
- Upload date:
- Size: 3.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7ce1d498c80d691ad5b7c79651d1c2a4bc92d1b8f3a7b2aa562276a6a7d21be |
|
MD5 | 28df449ed8911fe01f7fff7ffb5c06e9 |
|
BLAKE2b-256 | 3b7adb9623d1eff6a91dac937dc2cde7e4ab20c1ba4fa90fb01b1eb2d248bc62 |
File details
Details for the file flying_delta_readers_airbyte_cdk-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: flying_delta_readers_airbyte_cdk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3803d15c2d8250a063336fa97bec675135868c0238f59781ca27583c8b7e6a1f |
|
MD5 | 2aff7d2447c62d38041e5618f92db450 |
|
BLAKE2b-256 | 0f4e9284d63f16f7069d1f8dca9a43eab26b865f57370953b9ffe4c3a1ffc4b0 |