Skip to main content

Encapsulating Apache Spark for Easy Usage

Project description

Xursparks - XAIL's Apache Spark Framework

Overview

Welcome to the Xurpas AI Lab (XAIL) department's Apache Spark Framework. This framework is specifically designed to help XAIL developers implement Extract, Transform, Load (ETL) processes seamlessly and uniformly. Additionally, it includes integration capabilities with the Data Management and Configuration Tool (DMCT) to streamline your data workflows.

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Installation
  4. Usage
  5. Best Practices
  6. Contributing
  7. Support
  8. License

Introduction

This framework aims to provide a robust and standardized approach for XAIL developers to handle ETL processes using Apache Spark. By leveraging this framework, you can ensure that your data pipelines are efficient, maintainable, and easily integrable with the DMCT tool.

Prerequisites

Before you begin, ensure you have met the following requirements:

  • Apache Spark 3.0 or higher
  • Python 3.10 or higher
  • Access to the DMCT tool and relevant API keys

Installation

To use framework, follow these steps:

  1. install xursparks in python env:
pip install xursparks
  1. check if properly installed"
pip list

Usage

Setting Up Your Spark Application To start using the framework, create ETL Job as follows:

import xursparks

xursparks.initialize(args)

ETL Process Implementation

The framework provides predefined templates and utility functions to facilitate your ETL processes.

sourceTables = xursparks.getSourceTables()
sourceDataStorage = sourceTables.get("scheduled_manhours_ELE")
processDate = xursparks.getProcessDate()
sourceDataset = xursparks.loadSourceTable(dataStorage=sourceDataStorage,
												processDate=processDate)

Integration with DMCT

To integrate with the DMCT tool, ensure you have the required configurations set up in your application.properties file:

[default]
usage.logs=<usage logs>
global.config=<dmct global config api>
job.context=<dmct job context api>
api.token="dmct api token"

Best Practices

Always validate your data at each stage of the ETL process.

  • Leverage Spark's in-built functions and avoid excessive use of UDFs (User Defined Functions) for better performance.
  • Ensure proper error handling and logging to facilitate debugging.
  • Keep your ETL jobs modular and maintainable by adhering to the single responsibility principle.

Contributing

We welcome contributions to improve this framework. Please refer to the CONTRIBUTING.md file for guidelines on how to get started.

Support

If you encounter any issues or have questions, please reach out to the XAIL support team at support@xail.com.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.


Running Xursparks Job

  • SPARK-SUBMIT
spark-submit XurSparkSMain.py \
--master=local[*] \
--client-id=trami-data-folder \
--target-table=talentsolutions.candidate_reports \
--process-date=2023-05-24 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop Sir Andy Setp
python AiLabsCandidatesDatamart.py \
--master=local[*] \
--deploy-mode=cluster \
--client-id=trami-data-folder \
--target-table=ailabs.candidates_transformed \
--process-date=2023-11-15 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop
spark-submit \
--name AiLabsCandidatesDatamart \
--master yarn \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar \
--conf spark.yarn.dist.files=job-application.properties \
AiLabsCandidatesDatamart.py \
--keytab=hive.keytab \
--principal=hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--master=yarn \
--deploy-mode=cluster \
--client-id=trami-data-folder \
--target-table=ailabs.candidates_transformed \
--process-date=2023-11-16 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop 3.3.2
spark-submit \
--name AiLabsCandidatesDatamart \
--master yarn \
--keytab hive.keytab \
--principal hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar,hive-jdbc-3.1.3.jar \
--conf spark.yarn.dist.files=job-application.properties \
AiLabsCandidatesDatamart.py \
--keytab=hive.keytab \
--principal=hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--master=yarn \
--deploy-mode=client \
--client-id=trami-data-folder \
--target-table=ailabs.candidates_transformed \
--process-date=2023-11-17 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop testhdfs 3.3.2
spark-submit \
--name HdfsTest \
--master yarn \
--deploy-mode client \
--keytab hive.keytab \
--principal hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar \
--conf spark.yarn.dist.files=job-application.properties \
--driver-memory 4g \
--executor-memory 4g \
--executor-cores 2 \
HdfsTest.py \
--keytab=hive.keytab \
--principal=hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--master=yarn \
--deploy-mode=cluster \
--client-id=trami-data-folder \
--target-table=ailabs.candidates_transformed \
--process-date=2023-11-16 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop
spark-submit \
--name AiLabsCandidatesDatamart \
--master yarn \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar,hive-jdbc-3.1.3.jar \
--conf spark.yarn.dist.files=job-application.properties \
AiLabsCandidatesDatamart.py \
--master=yarn \
--deploy-mode=client \
--client-id=trami-data-folder \
--target-table=ailabs.candidates_transformed \
--process-date=2023-11-19 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop Employees
spark-submit \
--name AiLabsEmployeeDatamart \
--master yarn \
--keytab hive.keytab \
--principal hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar,hive-jdbc-3.1.3.jar,spark-excel_2.12-3.5.0_0.20.1.jar \
--conf spark.yarn.dist.files=job-application.properties \
AiLabsEmployeeDatamart.py \
--keytab=hive.keytab \
--principal=hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--master=yarn \
--deploy-mode=client \
--client-id=trami-data-folder \
--target-table=ailab.employees \
--process-date=2023-11-30 \
--properties-file=job-application.properties \
--switch=1
  • Hadoop Candidates
spark-submit \
--name AiLabsHdfsDatamart \
--master yarn \
--keytab hive.keytab \
--principal hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--jars aws-java-sdk-bundle-1.12.262.jar,hadoop-aws-3.3.4.jar,hive-jdbc-3.1.3.jar,spark-excel_2.12-3.5.0_0.20.1.jar \
--conf spark.yarn.dist.files=job-application.properties \
AiLabsHdfsDatamart.py \
--keytab=hive.keytab \
--principal=hive/hdfscluster.local@HDFSCLUSTER.LOCAL \
--master=yarn \
--deploy-mode=client \
--client-id=trami-data-folder \
--target-table=ailab.candidates_transformed_hdfs \
--process-date=2023-11-19 \
--properties-file=job-application.properties \
--switch=1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xursparks-1.0.17.tar.gz (45.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xursparks-1.0.17-py3-none-any.whl (46.9 kB view details)

Uploaded Python 3

File details

Details for the file xursparks-1.0.17.tar.gz.

File metadata

  • Download URL: xursparks-1.0.17.tar.gz
  • Upload date:
  • Size: 45.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for xursparks-1.0.17.tar.gz
Algorithm Hash digest
SHA256 bb901bda1226d6da7e100ef80f3ad3ad0cd3b7f72e5f6d07bd61139e1a8d53ef
MD5 9c2d48d007c494a2aa1b2e2119d924c9
BLAKE2b-256 7d0b55d952059f8573867be91e6439241af5d832e5f848200018b8235b7d4882

See more details on using hashes here.

File details

Details for the file xursparks-1.0.17-py3-none-any.whl.

File metadata

  • Download URL: xursparks-1.0.17-py3-none-any.whl
  • Upload date:
  • Size: 46.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for xursparks-1.0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 7f15c3dae168af352a7c4882282e9f23063d47128a55f41da34a998815028218
MD5 e23e5f8bafeef61639e538e8f2e40c14
BLAKE2b-256 89fbb907f94c0fdea7fcf7f114aac958c81c5da50fb6965d851dab36e363a061

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page