Skip to main content

Explore data files with pyspark

Project description

Spark File Explorer

When developing spark applications I came across the growing number of data files that I create.

CSVs are fine but what about JSON and complex PARQUET files?

To open and explore a file I used Excel to view CSV files, text editors with plugins to view JSON files, but there was nothing handy to view PARQUETs. Event formatted JSONs were not always readable. What about viewing schemas?

Each time I had to use spark and write simple apps which was not a problem itself but was tedious and boring.

Why not a database?

Well, for tabular data there problems is already solved - just use your preferred database. Quite often we can load text files or even parquets directly to the database.

So what's the big deal?

Hierarchical data sets

Unfortunately the files I often deal with have hierarchical structure. They cannot be simply visualized as tables or rather some fields contain tables of other structures. Each of these structures is a table itself but how to load and explore such embedded tables in a database?

For Spark files use... Spark!

Hold on - since I generate files using Apache Spark, why can't I use it to explore them? I can easily handle complex structures and file types using built-in features. So all I need is to build a use interface to display directories, files and their contents.

Why console?

I use Kubernetes in production environment, I develop Spark applications locally or in VM. In all environments I would like to have one tool to rule them all.

I like console tools a lot, they require some sort of simplicity. They can run locally or over SSH connection on the remote cluster. Sounds perfect. All I needed was a console UI library, so I wouldn't have to reinvent the wheel.

Textual

What a great project textual is!

Years ago I used curses but textual is so superior to what I used back then. It has so many features packed in a friendly form of simple to use components. Highly recommended.

Usage

Install package with pip:

pip install pyspark-explorer

Run:

pyspark-explorer

I recommend that you provide a base path. For local files that could be for example:

# Linux
pyspark-explorer file:///home/myuser/datafiles/base_path
# Windows
pyspark-explorer file:///c:/datafiles/base_path
# Remote hdfs cluster
pyspark-explorer hdfs://somecluster/datafiles/base_path

Default path is set to /, which represents local root filesystem and works fine even in Windows thanks to Spark logics.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_explorer-0.0.13.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyspark_explorer-0.0.13-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file pyspark_explorer-0.0.13.tar.gz.

File metadata

  • Download URL: pyspark_explorer-0.0.13.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for pyspark_explorer-0.0.13.tar.gz
Algorithm Hash digest
SHA256 ee6fb3e9c0d0e1207af9cbc2f0a88cf1ef9df55ad1e87a4d170c74276934bf02
MD5 ff8dfe3108fc41813d4084532ee2a243
BLAKE2b-256 9ae5d0fb56c02886a33e8a7b94bbd95ad35893dca007edf7fe405ac79d361c37

See more details on using hashes here.

File details

Details for the file pyspark_explorer-0.0.13-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_explorer-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 0f3979cc001eea524e50e6d88200406eb8524ea7059485f255710a81dda4c157
MD5 1c9c8aaaa61c0e0b9b9fa88fec2f63b5
BLAKE2b-256 bc7d49cc0e92dcf7e76c73f659d86f39d2bb949516919025e0eccd2af0b5fef4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page