Skip to main content

A Python Package for interacting with Cloudera Data Engineering Clusters

Project description

cdepy Package

cdepy is a package for interacting with Cludera Data Engineering Virtual Clusters.

You can find out more about Cloudera Data Engineering in the Cloudera Documentation.

Usage

You can install this package using

pip install cdepy

Features

  • CDE Resources: create resources of type Files and Python-Environment
  • CDE Jobs: create jobs of type Airflow and Spark
  • Job Observability: monitor job status

Examples

BASICS

from cdepy import cdeconnection
from cdepy import cdejob
from cdepy import cdemanager
from cdepy import cderesource
from cdepy import utils

Establish Connection to CDE Virtual Cluster

JOBS_API_URL = "https://<YOUR-CLUSTER>.cloudera.site/dex/api/v1"
WORKLOAD_USER = "<Your-CDP-Workload-User>"
WORKLOAD_PASSWORD = "<Your-CDP-Workload-Password>"

myCdeConnection = cdeconnection.CdeConnection(JOBS_API_URL, WORKLOAD_USER, WORKLOAD_PASSWORD)

myCdeConnection.setToken()

Create CDE Files Resource Definition

CDE_RESOURCE_NAME = "myFilesCdeResource"
myCdeFilesResource = cderesource.CdeFilesResource(CDE_RESOURCE_NAME)
myCdeFilesResourceDefinition = myCdeFilesResource.createResourceDefinition()

Create a CDE Spark Job Definition

CDE_JOB_NAME = "myCdeSparkJob"
APPLICATION_FILE_NAME = "pysparksql.py"

myCdeSparkJob = cdejob.CdeSparkJob(myCdeConnection)
myCdeSparkJobDefinition = myCdeSparkJob.createJobDefinition(CDE_JOB_NAME, CDE_RESOURCE_NAME, APPLICATION_FILE_NAME, executorMemory="2g", executorCores=2)

Create Resource and Job in CDE Cluster

LOCAL_FILE_PATH = "examples"
LOCAL_FILE_NAME = "pysparksql.py"

myCdeClusterManager = cdemanager.CdeClusterManager(myCdeConnection)


myCdeClusterManager.createResource(myCdeFilesResourceDefinition)
myCdeClusterManager.uploadFileToResource(CDE_RESOURCE_NAME, LOCAL_FILE_PATH, LOCAL_FILE_NAME)

myCdeClusterManager.createJob(myCdeSparkJobDefinition)

Describe Job

myCdeClusterManager.describeJob(CDE_JOB_NAME)

Run Job with Default Configurations

myCdeClusterManager.runJob(CDE_JOB_NAME)

Update Runtime Configurations

overrideParams = {"spark": {"executorMemory": "4g"}}
myCdeClusterManager.runJob(CDE_JOB_NAME, SPARK_OVERRIDES=overrideParams)

Validate Job Runs

jobRuns = myCdeClusterManager.listJobRuns()
json.loads(jobRuns)

Download Spark Event Logs

JOB_RUN_ID = "1"
logTypes = myCdeClusterManager.showAvailableLogTypes(JOB_RUN_ID)
json.loads(logTypes)

LOGS_TYPE = "driver/event"
sparkEventLogs = myCdeClusterManager.downloadJobRunLogs(JOB_RUN_ID, LOGS_TYPE)

sparkEventLogsClean = utils.sparkEventLogParser(sparkEventLogs)

print(sparkEventLogsClean)

Delete Job and Validate Deletion

CDE_JOB_NAME = "myCdeSparkJob"

myCdeClusterManager.deleteJob(CDE_JOB_NAME)

myCdeClusterManager.listJobs()

Describe Cluster Meta

myCdeClusterManager.describeResource(CDE_RESOURCE_NAME)

Remove Files from Files Resource

RESOURCE_FILE_NAME = "pysparksql.py"
myCdeClusterManager.removeFileFromResource(CDE_RESOURCE_NAME, RESOURCE_FILE_NAME)

Upload File to Resource

myCdeClusterManager.uploadFileToResource(CDE_RESOURCE_NAME, LOCAL_FILE_PATH, LOCAL_FILE_NAME)

Download File from Resource

myPySparkScript = myCdeClusterManager.downloadFileFromResource(CDE_RESOURCE_NAME, RESOURCE_FILE_NAME)

from pprint import pprint
pprint(myPySparkScript)

Pause Single Job

myCdeClusterManager.pauseSingleJob(CDE_JOB_NAME)

Delete Resource

CDE_RESOURCE_NAME = "myFilesCdeResource"

myCdeClusterManager.deleteResource(CDE_RESOURCE_NAME)

CDE AIRFLOW PYTHON ENVIRONMENTS

NB: There is only one Airflow Python Environment per CDE Virtual Cluster.

from cdepy import cdeconnection
from cdepy import cdeairflowpython
import os
import json

Connect via CdeConnection Object

JOBS_API_URL = "<myJobsAPIurl>"
WORKLOAD_USER = "<myusername>"
WORKLOAD_PASSWORD = "<mypwd>"

myCdeConnection = cdeconnection.CdeConnection(JOBS_API_URL, WORKLOAD_USER, WORKLOAD_PASSWORD)

myCdeConnection.setToken()

Use CdeAirflowPythonEnv object to manage Airflow Python Environments

myAirflowPythonEnvManager = cdeairflowpython.CdeAirflowPythonEnv(myCdeConnection)

Create a Maintenance Session in order to perform any Airflow Python Environments related actions

myAirflowPythonEnvManager.createMaintenanceSession()

First Create a pip repository

myAirflowPythonEnvManager.createPipRepository()

Check on Status of Maintenance Session

myAirflowPythonEnvManager.checkAirflowPythonEnvStatus()
###### STATUS SHOULD BE {"status":"pip-repos-defined"}

Load requirements.txt file

pathToRequirementsTxt = "/examples/requirements.txt"
myAirflowPythonEnvManager.buildAirflowPythonEnv(pathToRequirementsTxt)
###### requirements.txt file must be customized

myAirflowPythonEnvManager.checkAirflowPythonEnvStatus()
###### RESPONSE STATUS SHOULD BE {"status":"building"}
###### AFTER 2 MINUTES REPEAT THE REQUEST. RESPONSE STATUS SHOULD EVENTUALLY BE {"status":"built"}

Validate status of Python environment

myAirflowPythonEnvManager.getAirflowPythonEnvironmentDetails()

Explore Maintenace Session logs

myAirflowPythonEnvManager.viewMaintenanceSessionLogs()

Activate the Python environment

myAirflowPythonEnvManager.activateAirflowPythonEnv()

Check on Python environment build status

myAirflowPythonEnvManager.checkAirflowPythonEnvStatus()
###### AT FIRST RESPONSE STATUS SHOULD BE {"status":"activating"}
###### AFTER A COUPLE OF MINUTES THE MAINTENANCE SESSION WILL AUTOMATICALLY END. THIS MEANS THE AIRFLOW PYTHON ENV HAS ACTIVATED.

Optional: Create a new session and then delete the Python environment

myAirflowPythonEnvManager.deleteAirflowPythonEnv()

Optional: End the Maintenance Session once you have deleted the Python environment

myAirflowPythonEnvManager.deleteMaintenanceSession()

CDE REPOSITORIES

from cdepy import cdeconnection
from cdepy import cderepositories
import os
import json

JOBS_API_URL = "<myJobsAPIurl>"
WORKLOAD_USER = "<myusername>"
WORKLOAD_PASSWORD = "<mypwd>"

Connect via CdeConnection Object

myCdeConnection = cdeconnection.CdeConnection(JOBS_API_URL, WORKLOAD_USER, WORKLOAD_PASSWORD)
myCdeConnection.setToken()

Instantiate Repository Manager

myRepoManager = cderepositories.CdeRepositoryManager(myCdeConnection)

Provide example git repository information. This repository is available for tests.

repoName = "exampleGitRepository"
repoPath = "https://github.com/pdefusco/cde_git_repo.git"

Create CDE Repository from Git Repository

myRepoManager.createRepository(repoName, repoPath, repoBranch="main")

Show available CDE repositories

myRepoManager.listRepositories()

Show CDE Repository Metadata

myRepoManager.describeRepository(repoName)

Download file from CDE Repository

filePath = "simple-pyspark-sql.py"
myRepoManager.downloadFileFromRepo(repoName, filePath)

Sync CDE Repo with Origin

myRepoManager.syncRepository(repoName)

Delete CDE Repository

myRepoManager.deleteRepository(repoName)

Validate CDE Repository Deletion

myRepoManager.listRepositories()

Create a Basic Credential with Git Credentials

myCdeCredentialsManager = cdecredentials.CdeCredentialsManager(myCdeConnection)

credentialName = "myGitCredential"
credentialUsername = "<git-username>"
credentialPassword = "<git-token>"

myCdeCredentials = myCdeCredentialsManager.createBasicCredential(credentialName, credentialUsername, credentialPassword)

myCdeCredentialsManager.listCredentials()

Create a Repository with the Credential

myRepoManager = cderepositories.CdeRepositoryManager(myCdeConnection)

repoName = "examplePrivateRepository"
repoPath = "https://github.com/pdefusco/private_git_repo.git"

myRepoManager.createRepository(repoName, repoPath, repoCredentials=credentialName, repoBranch="main")

myRepoManager.listRepositories()

myRepoManager.describeRepository(repoName)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdepy-0.1.20.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cdepy-0.1.20-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file cdepy-0.1.20.tar.gz.

File metadata

  • Download URL: cdepy-0.1.20.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.0

File hashes

Hashes for cdepy-0.1.20.tar.gz
Algorithm Hash digest
SHA256 6e8a80f24ec6a7ab414057ff0df8b995236f78011ba74029b5385fb2bdd4c651
MD5 6b84b2787926e4d99c67f2ac36d826f4
BLAKE2b-256 bcade64f6350ff8352cdc6fd88ae7f86e25262d5289f81dc9af1da1a52579858

See more details on using hashes here.

File details

Details for the file cdepy-0.1.20-py3-none-any.whl.

File metadata

  • Download URL: cdepy-0.1.20-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.0

File hashes

Hashes for cdepy-0.1.20-py3-none-any.whl
Algorithm Hash digest
SHA256 5af53692a274551206860fdd8603404778f123368d4fc0f1705098f44d2f4b71
MD5 d745204294f1e7a79fae1965b7e5a99a
BLAKE2b-256 a3b292703ea5a782ca42473626b304f6b4fabc910c516fbc1dda96e4dbe0c169

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page