Compare Parquet file schemas across different filesystems
Project description
schemadiff
schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like Apache Spark
or Google BigQuery
, as unexpected schema differences can disrupt data loading and processing.
Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:
- BigQuery:
Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet
- Spark:
Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.
Installation
Install the package with pip:
pip install schemadiffed # schemadiff taken :p
Usage
The package can be used as a Python library or as a command-line tool.
Python Library
Here's an example of using schemadiff to group files by their schema:
import os
from schemadiff import compare_schemas
os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')
In this example, compare_schemas
groups the Parquet files in the directory path/to/parquet_files
by their schema. It saves the results to report.json
and also returns the grouped files as a list for potential downstream use.
Command-Line Interface
schemadiff can also be used as a command-line tool. After installation, the command compare-schemas
is available in your shell:
python schemadiff --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'
Features
- Efficient processing by reading the metadata of Parquet files.
- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
- Supports wildcard characters for flexible file selection.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for schemadiffed-0.1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7927879d4c6e177d773c7871eb5ee78ce8507f64a66e5b13dcc7ca4bee454e17 |
|
MD5 | 33f73a8d915eb8e1756f7fd13379b677 |
|
BLAKE2b-256 | 7178fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d |