dataforge-core

Command line compiler for dataforge core projects

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

DataForge Core-Light DataForge Core-Dark

DataForge helps data analysts and engineers build and extend data solutions by leveraging modern software engineering principles.

Understanding DataForge

DataForge enables writing of inline functions using single-column SQL expressions rather than CTEs, procedural scripts, or set-based models.

For an overview of the underlying concepts, check out this introduction blog.

Each function:

is pure, with no side effects
returns single column
is composable with other functions

DataForge software engineering principles:

These principles allow DataForge projects to be easy to modify and extend - even with thousands of integrated pipelines.

Explore the Core CLI or learn more about how Core powers DataForge Cloud.

Requirements

Dataforge Core is a code framework and command line tool to develop transformation functions and compile them into executable Spark SQL.

To run the CLI you will need:

Java 8 or higher
- Amazon Corretto is a great option
A PostgreSQL v14+ server with a dedicated empty database
- Check out our friends over at Tembo
Python version 3.12+
- Official Link

The CLI also includes an integration to run the code in Databricks. To support this you need:

Installation and Quickstart

Open a new command line window

Validate Java and Python are installed correctly:

> java --version
openjdk 21.0.3 2024-04-16 LTS

> python --version
Python 3.12.3

Install Dataforge by running:

> pip install dataforge-core
Collecting dataforge-core...
Installing collected packages: dataforge-core
Successfully installed dataforge-core...

Validate installation:

> dataforge --version
dataforge-core 1.0.0

Configure connections and credentials to Postgres and optionally Databricks

> dataforge --configure
Enter postgres connection string: postgresql://postgres:<postgres-server-url>:5432/postgres
Do you want to configure Databricks SQL Warehouse connection (y/n)? y
Enter Server hostname: <workspace-url>.cloud.databricks.com
Enter HTTP path: /sql/1.0/warehouses/<warehouse-guid>
Enter access token: <token-guid>
Enter catalog name: <unity_catalog_name>
Enter schema name: <schema_in_catalog_name>
Connecting to Databricks SQL Warehouse <workspace-url>.cloud.databricks.com
Databricks connection validated successfully
Profile saved in C:\Users...

Navigate to an empty folder and initialize project structure and sample files:
```
> dataforge --init
Initialized project in C:\Users...
```

Deploy dataforge structures to Postgres

> dataforge --seed
All objects in schema(s) log,meta in postgres database will be deleted. Do you want to continue (y/n)? y
Initializing database..
Database initialized

Build sample project

> dataforge --build
Validating project path C:\Users...
Started import with id 1
Importing project files...
<list of files>
Files parsed
Loading objects...
Objects loaded
Expressions validated
Generated 8 source queries
Generated 1 output queries
Generated run.sql
Import completed successfully

Execute in Databricks

> dataforge --run
Connecting to Databricks SQL Warehouse <workspace-url>.cloud.databricks.com
Executing query
Execution completed successfully

Commands

-h, --help	Display this help message and exit
-v, --version	Display the installed DataForge version
-c, --configure	Connect to Postgres database and optionally Databricks SQL Warehouse
-s, --seed	Deploy tables and scripts to postgres database
-i, --init [Project Path]	Initialize project folder structure with sample code
-b, --build [Project Path]	Compile code, store results in Postgres, and generate target SQL files
-r, --run [Project Path]	Run compiled project on Databricks SQL Warehouse
-p, --profile [Profile Path]	Update path of stored credentials profile file

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dfjswanson vorlov_dataforgelabs.com

Release history Release notifications | RSS feed

This version

1.2.0

Nov 14, 2024

1.1.0

Jul 3, 2024

1.0.1

May 10, 2024

1.0.0

Apr 30, 2024

0.1.1

Apr 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataforge_core-1.2.0.tar.gz (65.1 kB view details)

Uploaded Nov 14, 2024 Source

Built Distribution

dataforge_core-1.2.0-py3-none-any.whl (68.6 kB view details)

Uploaded Nov 14, 2024 Python 3

File details

Details for the file dataforge_core-1.2.0.tar.gz.

File metadata

Download URL: dataforge_core-1.2.0.tar.gz
Upload date: Nov 14, 2024
Size: 65.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dataforge_core-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`c7d4eba83644b575657c57a4e632c47da9af4b088db357954068f759a0634d36`
MD5	`0ca55d84fb9e7f8f96a6573307a1eff8`
BLAKE2b-256	`f947c4287b5a522f58551f487d800472d37804786579d972b2437b7883f7d31c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataforge_core-1.2.0.tar.gz:

Publisher: python-publish.yml on dataforgelabs/dataforge-core

Attestations:

Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataforge_core-1.2.0.tar.gz
- Subject digest: c7d4eba83644b575657c57a4e632c47da9af4b088db357954068f759a0634d36
- Sigstore transparency entry: 148928111
- Sigstore integration time: Nov 14, 2024