Explore data files with pyspark

These details have not been verified by PyPI

Project links

Project description

Spark File Explorer

When developing spark applications I came across the growing number of data files that I create.

pe03

pe04

CSVs are fine but what about JSON and complex PARQUET files?

To open and explore a file I used Excel to view CSV files, text editors with plugins to view JSON files, but there was nothing handy to view PARQUETs. Event formatted JSONs were not always readable. What about viewing schemas?

Each time I had to use spark and write simple apps which was not a problem itself but was tedious and boring.

Why not a database?

Well, for tabular data there problems is already solved - just use your preferred database. Quite often we can load text files or even parquets directly to the database.

So what's the big deal?

Hierarchical data sets

Unfortunately the files I often deal with have hierarchical structure. They cannot be simply visualized as tables or rather some fields contain tables of other structures. Each of these structures is a table itself but how to load and explore such embedded tables in a database?

For Spark files use... Spark!

Hold on - since I generate files using Apache Spark, why can't I use it to explore them? I can easily handle complex structures and file types using built-in features. So all I need is to build a use interface to display directories, files and their contents.

Why console?

I use Kubernetes in production environment, I develop Spark applications locally or in VM. In all environments I would like to have one tool to rule them all.

I like console tools a lot, they require some sort of simplicity. They can run locally or over SSH connection on the remote cluster. Sounds perfect. All I needed was a console UI library, so I wouldn't have to reinvent the wheel.

Textual

What a great project textual is!

Years ago I used curses but textual is so superior to what I used back then. It has so many features packed in a friendly form of simple to use components. Highly recommended.

Usage

Install package with pip:

pip install pyspark-explorer

Run:

pyspark-explorer

You may wish to provide a base path upfront. It can be changed at any time (press o for Options).

For local files that could be for example:

# Linux
pyspark-explorer file:///home/myuser/datafiles/base_path
# Windows
pyspark-explorer file:/c:/datafiles/base_path

For remote location:

# Remote hdfs cluster
pyspark-explorer hdfs://somecluster/datafiles/base_path

Default path is set to /, which represents local root filesystem and works fine even in Windows thanks to Spark logics.

Configuration files are saved to your home directory (.pyspark-explorer subdirectory). These are json files so you are free to edit them.

Spark limitations

Note that you will not be able to open any JSON file - only those with correct structure can be viewed. If you try to open a file which has an unacceptable structure, Spark will throw an error, e.g.:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

or e.g.

[COLUMN_ALREADY_EXISTS] The column `event` already exists. Consider to choose another name or rename the existing column.

or e.g.

'NoneType' object has no attribute '__fields__'

etc.

You can find the log file in your home directory (.pyspark-explorer subdirectory).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.5

Mar 4, 2026

0.2.4

Feb 13, 2025

This version

0.2.3

Feb 13, 2025

0.2.2

Jan 6, 2025

0.2.1

Jan 5, 2025

0.2.0

Jan 5, 2025

0.1.0

Jan 3, 2025

0.0.16

Jan 3, 2025

0.0.15

Jan 2, 2025

0.0.13

Jan 2, 2025

0.0.12

Jan 2, 2025

0.0.11

Dec 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_explorer-0.2.3.tar.gz (19.0 kB view details)

Uploaded Feb 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyspark_explorer-0.2.3-py3-none-any.whl (17.3 kB view details)

Uploaded Feb 13, 2025 Python 3

File details

Details for the file pyspark_explorer-0.2.3.tar.gz.

File metadata

Download URL: pyspark_explorer-0.2.3.tar.gz
Upload date: Feb 13, 2025
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for pyspark_explorer-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`91345f463b8637bd2dd21a4333f86125429b5aad35a991951a9073588ffa0035`
MD5	`0d85869ff7037137c1ec75e64a18264c`
BLAKE2b-256	`608d0f016b0b1a66ff774ad690a9cbfb62ae3afb292be2d7a7a0a8ede4817eee`

See more details on using hashes here.

File details

Details for the file pyspark_explorer-0.2.3-py3-none-any.whl.

File metadata

Download URL: pyspark_explorer-0.2.3-py3-none-any.whl
Upload date: Feb 13, 2025
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for pyspark_explorer-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de0dc63b27a660f62234136b7aaababc5fa6e21908644b1dfd8d5e9f888b78e0`
MD5	`869c45b508746704be8259bd51aaa9c7`
BLAKE2b-256	`3222f819af1d0f1b93b300a3b6de5b1761be642814234c0bb189fbb2fc7a5e2c`

See more details on using hashes here.

pyspark-explorer 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark File Explorer

CSVs are fine but what about JSON and complex PARQUET files?

Why not a database?

Hierarchical data sets

For Spark files use... Spark!

Why console?

Textual

Usage

Spark limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes