Enables Versatile Data Kit (VDK) to integrate with various data sources by providing a unified interface for data ingestion and management.

These details have not been verified by PyPI

Project links

Project description

data-sources

Enables Versatile Data Kit (VDK) to integrate with various data sources by providing a unified interface for data ingestion and management.

The data-sources project is a plugin for the Versatile Data Kit (VDK). It aims to simplify data ingestion from multiple sources by offering a single, unified API. Whether you're dealing with databases, REST APIs, or other forms of data, this project allows you to manage them all in a consistent manner. This is crucial for building scalable and maintainable data pipelines.

Usage

pip install vdk-data-sources

Concepts

Data Source

A Data Source is a central component responsible for establishing and managing a connection to a specific set of data. It interacts with a given configuration and maintains a stateful relationship with the data it accesses. This stateful relationship can include information such as authentication tokens, data markers, or any other form of metadata that helps manage the data connection. The Data Source exposes various data streams through which data can be read.

Data Source Stream

A Data Source Stream is an abstraction over a subset of data in the Data Source. It can be thought of as a channel through which data flows. Each Data Source Stream has a unique name to identify it and includes methods to read data from the stream. Streams cna be ingested in parallel.

Examples:

In a database (like postgres), each table could be a separate stream.
In a message broker like Apache Kafka, each topic within Kafka acts as a distinct Data Source Stream.
In an REST API , the data source is the HTTP base URL (http://xxx.com). The data stream could be each different endpoint (http://xxx.com/users, http://xxx/admins)

Reading from the stream yields a sequence of Data Source Payloads

Data Source Payload

The Data Source Payload is a data structure that encapsulates the actual data along with its metadata. The payload consists of four main parts:

Data: containing the core data that needs to be ingested (e.g in database the table content) Metadata: A dictionary containing additional contextual information about the data (for example timestamps, environment specific metadata, etc.) State: Contains the state of the data soruce stream as of this payload. For example in case of incremental ingestion from a database table it would contain the value of a incremental key columns (le.g updated_time column in teh table) which can be used to restart/continue the ingestion later.

Configuration

(vdk config-help is useful command to browse all config options of your installation of vdk)

Example

To build your own data source you can use this data source as an example or reference

To register the source use vdk_data_sources_register hook

Then you can use it in a data job like this:

def run(job_input: IJobInput):
    source = SourceDefinition(id="auto", name="auto-generated-data", config={})
    destination = DestinationDefinition(id="auto-dest", method="memory")

    with DataFlowInput(job_input) as flow_input:
        flow_input.start(DataFlowMappingDefinition(source, destination))

or in config.toml file

[sources.auto]
name="auto-generated-data"
config={}
[destinations.auto-dest]
method="memory"
[[flows]]
from="auto"
to="auto-dest"

def run(job_input: IJobInput):
    with DataFlowInput(job_input) as flow_input:
        flow_input.start_flows(toml_parser.load_config("config.toml"))

Build and testing

pip install -r requirements.txt
pip install -e .
pytest

In VDK repo ../build-plugin.sh script can be used also.

Note about the CICD:

.plugin-ci.yaml is needed only for plugins part of Versatile Data Kit Plugin repo.

The CI/CD is separated in two stages, a build stage and a release stage. The build stage is made up of a few jobs, all which inherit from the same job configuration and only differ in the Python version they use (3.7, 3.8, 3.9 and 3.10). They run according to rules, which are ordered in a way such that changes to a plugin's directory trigger the plugin CI, but changes to a different plugin does not.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1431637373

Aug 29, 2024

0.1.1190994517

Feb 26, 2024

0.1.1184833162

Feb 21, 2024

0.1.1069988349

Nov 13, 2023

This version

0.1.1058821821

Nov 2, 2023

0.1.1058734767

Nov 2, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdk-data-sources-0.1.1058821821.tar.gz (19.3 kB view details)

Uploaded Nov 2, 2023 Source

File details

Details for the file vdk-data-sources-0.1.1058821821.tar.gz.

File metadata

Download URL: vdk-data-sources-0.1.1058821821.tar.gz
Upload date: Nov 2, 2023
Size: 19.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for vdk-data-sources-0.1.1058821821.tar.gz
Algorithm	Hash digest
SHA256	`2317b58d5ddf89681cbde5c80b260172a068ffa35c769882da262b5905de2d23`
MD5	`5c248c20c46affb04b9d1673835cb624`
BLAKE2b-256	`fe9c0c7a5c8e58dfaaa1cd200b1c3e4dfef2831dd372acfaf5b5d346507f8bb7`