Compare Parquet file schemas across different filesystems
Project description
schemadiff
schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like Apache Spark
or Google BigQuery
, as unexpected schema differences can disrupt data loading and processing.
Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:
- BigQuery:
Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet
- Spark:
Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.
Installation
Install the package with pip:
pip install schemadiffed # schemadiff taken :p
Usage
The package can be used as a Python library or as a command-line tool.
Python Library
Here's an example of using schemadiff to group files by their schema:
import os
from schemadiff import compare_schemas
os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')
In this example, compare_schemas
groups the Parquet files in the directory path/to/parquet_files
by their schema. It saves the results to report.json
and also returns the grouped files as a list for potential downstream use.
Command-Line Interface
schemadiff can also be used as a command-line tool. After installation, the command compare-schemas
is available in your shell:
python schemadiff --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'
Features
- Efficient processing by reading the metadata of Parquet files.
- Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
- Supports wildcard characters for flexible file selection.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file schemadiffed-0.1.0.1.tar.gz
.
File metadata
- Download URL: schemadiffed-0.1.0.1.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7d61f4023061d29a7b68bdf581e99797aa201f23f9da4cd1b0873481212fb73 |
|
MD5 | 6623bcb7c6d618e532576e62c4f79c34 |
|
BLAKE2b-256 | ede3b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde |
File details
Details for the file schemadiffed-0.1.0.1-py3-none-any.whl
.
File metadata
- Download URL: schemadiffed-0.1.0.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7927879d4c6e177d773c7871eb5ee78ce8507f64a66e5b13dcc7ca4bee454e17 |
|
MD5 | 33f73a8d915eb8e1756f7fd13379b677 |
|
BLAKE2b-256 | 7178fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d |