Spark SQL framework for Databricks jobs

These details have not been verified by PyPI

Project links

Homepage

Project description

sparqlin

sparqlin is a Spark SQL framework designed to simplify job creation and management in Databricks environments. It integrates with Spark SQL and PySpark for a streamlined development experience.

The framework was specifically created to empower data analysts who may not have deep development skills. It provides a streamlined approach to adopting standard software development life cycles, enabling analysts to focus on working with data without the need to master complex programming paradigms. By leveraging familiar tools like SQL scripts and YAML files, the framework simplifies tasks such as data configuration, transformation, and testing.

This enables teams to:

Bridge the gap between data analysis and software engineering.
Enhance collaboration and maintain clear development processes.
Encourage reusable and maintainable data workflows, all while adhering to best practices.

Features

Simplifies the creation of Spark SQL jobs for Databricks.
Flexible integration with PySpark and Spark SQL.
YAML-based configuration for job definitions.
Built-in support for testing through pytest.
Integrated with tools like GitPython and system monitoring via psutil.
Works with Databricks Bundles.

Installation

You can install sparqlin directly from PyPI using pip. pip install sparqlin

Requirements

sparqlin requires Python 3.11 or higher (Databricks Runtime 15.4 LTS). Ensure you have the following dependencies installed:

pyspark>=3.5.0
pytest
pyyaml
psutil
gitpython

Getting Started

Example Usage of ETL framework

To use sparqlin for creating and running Spark SQL jobs in Databricks, follow these steps:

Initialize a Project: Start by creating a structure for your project. For instance, define YAML configuration files for your Databricks jobs.
Create Spark SQL transformations: You can create a directory and place .sql files with queries that will be run on Databricks.
Load and Run Jobs: Use the provided framework functionality to parse configurations and execute jobs efficiently in Databricks.

Here is a typical layout of an analytical project:

databricks_default_project/
â”œâ”€â”€ README.md
â”œâ”€â”€ sql/
|   â”œâ”€â”€ query1.sql
â”‚   â”œâ”€â”€ query2.sql
â”‚   â””â”€â”€ ...
â”œâ”€â”€ tests/
â”‚   â”œâ”€â”€ __init__.py
â”‚   â”œâ”€â”€ test_query1.py
â”‚   â”œâ”€â”€ test_query2.py
â”‚   â””â”€â”€ ...
|â”€â”€ resources/
|   â””â”€â”€ databricks_default_project.job.yml
â””â”€â”€ databricks.yml

Below is an example configuration of jobs and tasks (databricks_default_project.job.yml):

# The example job configuration for databricks_default_python.
resources:
  jobs:
    databricks_default_python_job:
      trigger:
        # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
        periodic:
          interval: 1
          unit: DAYS
      email_notifications:
        on_failure:
          - some_email@some_domain.com
      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: i3.xlarge
            autoscale:
              min_workers: 1
              max_workers: 1
      tasks:
        - task_key: query1_sparqlin
          # existing_cluster_id: ${var.cluster_id} # Existing cluster can be used instead
          job_cluster_key: job_cluster
          libraries:
            - pypi:
                package: sparqlin==0.1.11          # Install your package via PyPI (or custom repository)
            # - whl: ../dist/*.whl                 # Alternatively, you can upload the sparqlin wheel into Volume
          python_wheel_task:
            package_name: "sparqlin"               # Package name as defined in your setup.py or pyproject.toml
            entry_point: "sparqlin"                # Entry point defined in the package
            named_parameters:
              sql-query-path: "${workspace.root_path}/files/sql/query1.sql"
              table-name: "sparkdev.default.taxi_top_five"
        - task_key: query2_sparqlin
          depends_on:
            - task_key: query1_sparqlin
          # existing_cluster_id: ${var.cluster_id} # Existing cluster can be used instead
          job_cluster_key: job_cluster
          libraries:
            - pypi:
                package: sparqlin==0.1.11          # Install your package via PyPI (or custom repository)
            # - whl: ../dist/*.whl                 # Alternatively, you can upload the sparqlin wheel into Volume
          python_wheel_task:
            package_name: "sparqlin"               # Package name as defined in your setup.py or pyproject.toml
            entry_point: "sparqlin"                # Entry point defined in the package
            named_parameters:
              sql-query-path: "${workspace.root_path}/files/sql/query2.sql"
              table-name: "sparkdev.default.taxi_count"
        # You can also continue `tasks` to use other frameworks or task types that Databricks support

Example Spark SQL transformation (sql/query.sql)

SELECT * FROM samples.nyctaxi.trips LIMIT 5;

Example Usage of Testing framework

Test Parameterized Dataset Paths

This example tests loading datasets into Spark DataFrames from YAML configuration files. It uses pytest fixtures to dynamically provide the datasets_path.

from sparqlin.testing.helpers import get_spark_dataframe


@pytest.mark.parametrize("datasets_path", ["tests/testing/datasets_test/datasets.yml"], indirect=True)
def test_base_test_config(spark_session, datasets_path):
    # Load test table as DataFrame
    test_table_df = get_spark_dataframe(spark_session, datasets_path, "testdb.test_table")
    second_table_df = get_spark_dataframe(spark_session, datasets_path, "testdb.second_table")

    # Validate record counts
    assert test_table_df.count() == 3
    assert second_table_df.count() == 2

Configuring and Testing Temporary Hive Tables

This example demonstrates how to use BaseTestConfig to register tables as temporary datasets in Spark and perform SQL operations.

from sparqlin.testing.base_test_config import BaseTestConfig


def test_hive_table_operations(hive_data_yaml, tmp_path_factory):
    datasets_file, tmp_path = hive_data_yaml

    # Initialize BaseTestConfig
    config = BaseTestConfig(tmp_path_factory)

    # Set datasets location
    config.DATASETS_LOCATION = datasets_file

    # Create Spark session
    spark = config.create_spark_session()

    # Register tables from YAML file
    config.register_tables(spark)

    # Verify table registration
    test_table_df = spark.sql("SELECT * FROM testdb.test_table")
    second_table_df = spark.sql("SELECT * FROM testdb.second_table")
    assert test_table_df.count() == 3

    # Perform join operation
    joined_df = test_table_df.join(second_table_df, test_table_df.id == second_table_df.id)
    joined_results = joined_df.select("name", "value").collect()
    assert len(joined_results) == 2
    assert any(row.name == "Alice" and row.value == 100 for row in joined_results)

Development Setup

To contribute or set up a local development environment for sparqlin, follow these steps:

Clone the repository:

git clone https://gitlab.com/rokorolev/sparqlin.git
cd sparqlin

Install dependencies:
```
pip install -r requirements.txt
```
Run the tests: The framework uses pytest for testing. You can run the test suite as follows:
```
pytest
```

Build the Package

Install Build Tools pip install setuptools wheel

Build the Package

rm -rf build dist *.egg-info
python setup.py sdist bdist_wheel

Upload the Package to PyPi

Install Twine pip install twine
Generate token for PyPi account
Upload the Package twine upload dist/*

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions

Contributions are welcome! Feel free to fork the repository, create a feature branch, and submit a pull request. Please ensure proper test coverage for new functionality.

Issues

If you encounter a bug or have a feature request, please open an issue on the project's GitLab repository.

Author

Developed and maintained by Roman Korolev.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.16

Mar 13, 2025

0.1.15

Mar 13, 2025

0.1.14

Mar 12, 2025

0.1.13

Feb 5, 2025

0.1.12

Jan 9, 2025

This version

0.1.11

Jan 9, 2025

0.1.10

Jan 9, 2025

0.1.9

Jan 9, 2025

0.1.8

Jan 9, 2025

0.1.7

Jan 4, 2025

0.1.6

Jan 4, 2025

0.1.5

Jan 4, 2025

0.1.4

Jan 4, 2025

0.1.3

Jan 4, 2025

0.1.2

Jan 3, 2025

0.1.1

Jan 3, 2025

0.1.0

Jan 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparqlin-0.1.11.tar.gz (17.1 kB view details)

Uploaded Jan 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparqlin-0.1.11-py3-none-any.whl (16.8 kB view details)

Uploaded Jan 9, 2025 Python 3

File details

Details for the file sparqlin-0.1.11.tar.gz.

File metadata

Download URL: sparqlin-0.1.11.tar.gz
Upload date: Jan 9, 2025
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for sparqlin-0.1.11.tar.gz
Algorithm	Hash digest
SHA256	`dc7f92be07927a272c2405f79381d2d76c384311220ceedbc0f4766a8744b327`
MD5	`bd0fd37689db1d72c24a2ccbababf414`
BLAKE2b-256	`15f2557e28de10a95e83894d244168923200f6d0bb31be9a85ae73b986f525bb`

See more details on using hashes here.

File details

Details for the file sparqlin-0.1.11-py3-none-any.whl.

File metadata

Download URL: sparqlin-0.1.11-py3-none-any.whl
Upload date: Jan 9, 2025
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for sparqlin-0.1.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`875dcf98931ea7cb2b8de391af8981281b98af3910faf59654c2fa40436ea713`
MD5	`cfc9a737042fcddc2127f84f1b67b549`
BLAKE2b-256	`a8b95478eb1d272d5e5be19850fb007ac797b2fb1d6d8f4c3046055da4148a78`

See more details on using hashes here.

sparqlin 0.1.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sparqlin

Features

Installation

Requirements

Getting Started

Example Usage of ETL framework

Example Spark SQL transformation (sql/query.sql)

Example Usage of Testing framework

Test Parameterized Dataset Paths

Configuring and Testing Temporary Hive Tables

Development Setup

Build the Package

Upload the Package to PyPi

License

Contributions

Issues

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes