Skip to main content

Compare Parquet file schemas across different filesystems

Project description

schemadiff

schemadiff is a niche package designed for situations where a — large — number of files on a filesystem are expected to have identical schemas, but they don't. This can present a challenge when working with distributed computing systems like Apache Spark or Google BigQuery, as unexpected schema differences can disrupt data loading and processing.

Consider a scenario where you are processing thousands of files, and a subset of them have schemas that are almost identical but not completely matching. This can lead to errors such as:

  • BigQuery: Error while reading data, error message: Parquet column '<COLUMN_NAME>' has type INT32 which does not match the target cpp_type DOUBLE File: gs://bucket/file.parquet
  • Spark: Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

schemadiff addresses these issues by efficiently identifying the files with schema inconsistencies through reading file metadata.

Installation

Install the package with pip:

pip install schemadiffed # schemadiff taken :p

Usage

The package can be used as a Python library or as a command-line tool.

Python Library

Here's an example of using schemadiff to group files by their schema:

import os
from schemadiff import compare_schemas

os.environ['GOOGLE_CLOUD_CREDENTIALS'] = 'key.json'
grouped_files = compare_schemas('path/to/parquet_files', report_path='/desired/path/to/report.json')

In this example, compare_schemas groups the Parquet files in the directory path/to/parquet_files by their schema. It saves the results to report.json and also returns the grouped files as a list for potential downstream use.

Command-Line Interface

schemadiff can also be used as a command-line tool. After installation, the command compare-schemas is available in your shell:

python schemadiff  --dir_path 'gs://<bucket>/yellow/*_2020*.parquet' --fs_type 'gcs' --report_path 'report.json' --return_type 'as_list'

Features

  • Efficient processing by reading the metadata of Parquet files.
  • Supports local, GCS, S3 filesystems (you must be authenticated to your cloud service first).
  • Supports wildcard characters for flexible file selection.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schemadiffed-0.1.0.1.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

schemadiffed-0.1.0.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file schemadiffed-0.1.0.1.tar.gz.

File metadata

  • Download URL: schemadiffed-0.1.0.1.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure

File hashes

Hashes for schemadiffed-0.1.0.1.tar.gz
Algorithm Hash digest
SHA256 c7d61f4023061d29a7b68bdf581e99797aa201f23f9da4cd1b0873481212fb73
MD5 6623bcb7c6d618e532576e62c4f79c34
BLAKE2b-256 ede3b393eec39ef2af06d4dbb83e647d1186acb6074799794c0abae8b990fbde

See more details on using hashes here.

File details

Details for the file schemadiffed-0.1.0.1-py3-none-any.whl.

File metadata

  • Download URL: schemadiffed-0.1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.6 Linux/5.15.0-1041-azure

File hashes

Hashes for schemadiffed-0.1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7927879d4c6e177d773c7871eb5ee78ce8507f64a66e5b13dcc7ca4bee454e17
MD5 33f73a8d915eb8e1756f7fd13379b677
BLAKE2b-256 7178fa649da16959c4d38df202760f4da99f88abca41b3bbf52fc5bf9274532d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page