Skip to main content

llama-index readers airbyte_cdk integration

Project description

Airbyte CDK Loader

The Airbyte CDK Loader is a shim for sources created using the Airbyte Python CDK. It allows you to load data from any Airbyte source into LlamaIndex.

Installation

  • Install llama_hub: pip install llama_hub
  • Install airbyte-cdk: pip install airbyte-cdk
  • Install a source via git (or implement your own): pip install git+https://github.com/airbytehq/airbyte.git@master#egg=source_github&subdirectory=airbyte-integrations/connectors/source-github

Usage

Implement and import your own source. You can find lots of resources for how to achieve this on the Airbyte documentation page.

Here's an example usage of the AirbyteCdkReader.

from llama_index import download_loader
from llama_hub.airbyte_cdk import AirbyteCDKReader
from source_github.source import (
    SourceGithub,
)  # this is just an example, you can use any source here - this one is loaded from the Airbyte Github repo via pip install git+https://github.com/airbytehq/airbyte.git@master#egg=source_github&subdirectory=airbyte-integrations/connectors/source-github`


github_config = {
    # ...
}
reader = AirbyteCDKReader(source_class=SourceGithub, config=github_config)
documents = reader.load_data(stream_name="issues")

By default all fields are stored as metadata in the documents and the text is set to the JSON representation of all the fields. Construct the text of the document by passing a record_handler to the reader:

def handle_record(record, id):
    return Document(
        doc_id=id, text=record.data["title"], extra_info=record.data
    )


reader = AirbyteCDKReader(
    source_class=SourceGithub,
    config=github_config,
    record_handler=handle_record,
)

Lazy loads

The reader.load_data endpoint will collect all documents and return them as a list. If there are a large number of documents, this can cause issues. By using reader.lazy_load_data instead, an iterator is returned which can be consumed document by document without the need to keep all documents in memory.

Incremental loads

If a stream supports it, this loader can be used to load data incrementally (only returning documents that weren't loaded last time or got updated in the meantime):

reader = AirbyteCDKReader(source_class=SourceGithub, config=github_config)
documents = reader.load_data(stream_name="issues")
current_state = reader.last_state  # can be pickled away or stored otherwise

updated_documents = reader.load_data(
    stream_name="issues", state=current_state
)  # only loads documents that were updated since last time

This loader is designed to be used as a way to load data into LlamaIndex and/or subsequently used as a Tool in a LangChain Agent. See here for examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file flying_delta_readers_airbyte_cdk-0.1.0.tar.gz.

File metadata

File hashes

Hashes for flying_delta_readers_airbyte_cdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c7ce1d498c80d691ad5b7c79651d1c2a4bc92d1b8f3a7b2aa562276a6a7d21be
MD5 28df449ed8911fe01f7fff7ffb5c06e9
BLAKE2b-256 3b7adb9623d1eff6a91dac937dc2cde7e4ab20c1ba4fa90fb01b1eb2d248bc62

See more details on using hashes here.

File details

Details for the file flying_delta_readers_airbyte_cdk-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for flying_delta_readers_airbyte_cdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3803d15c2d8250a063336fa97bec675135868c0238f59781ca27583c8b7e6a1f
MD5 2aff7d2447c62d38041e5618f92db450
BLAKE2b-256 0f4e9284d63f16f7069d1f8dca9a43eab26b865f57370953b9ffe4c3a1ffc4b0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page