Skip to main content

No project description provided

Project description

eren

This library contains Hive helper methods for PySpark users.

This README is a good way to learn about the Hive metastore for PySpark, even if you're not going to use this library directly.

eren

Hive metastore background

The Hive metastore contains information on tables, where the data is located, and the schema of the data.

Registering tables in the Hive metastore is a great way to make nice Spark workflows for an organization. You can setup your Spark cluster to automatically regsiter all your organization's tables, so users can spin up clusters and immediately start querying the data with SQL syntax.

Suppose you have a table called important_customers and a database analyst that would like to query this table to generate a report for the accounting department on a weekly basis. If the important_customers table is registered in the Hive metastore, then the analyst can spin up a Spark cluster and run their queries, without having to worry about where the underlying data is stored or the schema.

How this library helps

This library contains helper methods to make it easier to manage your Hive metastore. You can use this libray to easily determine if a Hive table is managed, unmanaged, or non-registered for example.

You can also use this library to register Hive tables that are not currently in the metastore.

This library makes common Hive tasks for PySpark programmers easy.

Hive managed vs unmanaged vs non-registered tables

Hive managed tables are registered in the Hive metastore and have paths that are managed by Hive. Here's an example of how to create a managed table.


This table stores data in XX.

Hive Parquet vs Delta Lake tables

Working with Hadoop File System from Python

Eren also provides a top level interface for working with any implementation of HadoopFileSystem (ibcluding S3, HDFS/DBFS and Local File System).

from eren import fs

# spark: SparkSession

hdfs = fs.HadoopFileSystem(spark, "hdfs:///data") # returns an instance for access to HDFS
s3fs = fs.HadoopFileSystem(spark, "s3a:///bucket/prefix") # returns an instance for access to S3
local_fs = fs.HadoopFileSystem(spark, "file:///home/my-home") # return an instance for access to LocalFS

Glob Files

You can list files and directories on remote file system:

list_of_files = hdfs.glob("hdfs:///raw/table/**/*.csv") # list all CSV files recursively inside a folder on HDFS

Read/Write UTF-8 strings

You can read and write UTF-8 strings from/to remote file system. But be carefully! This method is extremely slow compared to regular spark.read. Do not use this for reading/writing data!

import json

hdfs.write_utf8("hdfs:///data/results/results.json", json.dumps({"key1": 1, "key2": 3}), mode="w")
json.loads(hdfs.read_utf8("hdfs:///data/results/results.json"))

Community

Blogs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eren-0.0.1.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

eren-0.0.1-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file eren-0.0.1.tar.gz.

File metadata

  • Download URL: eren-0.0.1.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for eren-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d527f13d3a64dea6390b8bbada61ab42cbb9d4bbf1f90ebd6b84435f9f6f037a
MD5 d1ef24e35efaacc24dcbe3c69e12c988
BLAKE2b-256 a37d236d210543e4b9755b93fd296fe83d9c471932ac22b541790e8cdf413db6

See more details on using hashes here.

File details

Details for the file eren-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: eren-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.9.5 Darwin/20.3.0

File hashes

Hashes for eren-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 76bbde0d377e236c9bf9b55ef0ee3a8d5ee3752d9aa349c583cf6eed2bd16a84
MD5 3bb970349389d0209d35d2764e8a0c80
BLAKE2b-256 c260fb622acd68551b2df1d1add65a71af747ac970b329eaeb43a1f2a4b08646

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page