Skip to main content

Datadock is a PySpark-based data interoperability library. It automatically detects schemas from heterogeneous files (CSV, JSON, Parquet), groups them by structural similarity, and performs standardized batch reads. Designed for pipelines handling non-uniform large-scale data, enabling robust integration and reuse in distributed environments.

Project description

Datadock

Datadock is a Python library built on top of PySpark, designed to simplify data interoperability between files of different formats and schemas in modern data engineering pipelines.

It automatically detects schemas from CSV, JSON and Parquet files, groups structurally similar files, and allows standardized reading of all grouped files into a single Spark DataFrame — even in highly heterogeneous datasets.

✨ Key Features

  • 🚀 Automatic parsing of multiple file formats: .csv, .json, .parquet
  • 🧠 Schema-based file grouping by structural similarity
  • 📊 Auto-selection of dominant schemas
  • 🛠️ Unified read across similar files into a single PySpark DataFrame
  • 🔍 Schema insight for diagnostics and inspection

🔧 Installation

pip install datadock

🗂️ Expected Input Structure

Place your data files (CSV, JSON or Parquet) inside a single folder. The library will automatically detect supported files and organize them by schema similarity.

/data/input/
├── sales_2020.csv
├── sales_2021.csv
├── products.json
├── archive.parquet
├── log.parquet

🧪 Usage Example

from datadock import scan_schema, get_schema_info, read_data

path = "/path/to/your/data"

# Logs schema groups detected
scan_schema(path)

# Retrieves schema metadata
info = get_schema_info(path)
print(info)

# Loads all files from schema group 1
df = read_data(path, schema_id=1, logs=True)
df.show()

📌 Public API

scan_schema

Logs the identified schema groups found in the specified folder.

get_schema_info

Returns a list of dictionaries containing:

  • schema_id: ID of the schema group
  • file_count: number of files in the group
  • column_count: number of columns in the schema
  • files: list of file names in the group

read_data

Reads and merges all files that share the same schema.
If schema_id is not specified, the group with the most columns will be selected.

✅ Requirements

  • Python 3.10+
  • PySpark

📚 Motivation

In real-world data engineering workflows, it's common to deal with files that represent the same data domain but have slight structural variations — such as missing columns, different orders, or evolving schemas.
Datadock automates the process of grouping, inspecting, and reading these files reliably, allowing you to build pipelines that are schema-aware, scalable, and format-agnostic.

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadock-0.1.2.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadock-0.1.2-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file datadock-0.1.2.tar.gz.

File metadata

  • Download URL: datadock-0.1.2.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for datadock-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6f79c0cb9a01226f58ce3b76b89f4dfa5441ccfd8e2277c0532d12605ab19e0c
MD5 3d8f6d8c08fecee99b9d03dd129a9521
BLAKE2b-256 334790ad0d50cf9d3d0db134aeebd8e01e9863b151816fb569feae9fac97eb94

See more details on using hashes here.

File details

Details for the file datadock-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: datadock-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/5.15.167.4-microsoft-standard-WSL2

File hashes

Hashes for datadock-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9f654c3bad33cbb46e9d5a6743a37c7c952784e0de657a3ca5b23353cac868d7
MD5 dd908c53a1c204be600f091e97d5d028
BLAKE2b-256 8ec98da0e3078ff58bf5ebfba94c93a6464a6d8040fa39aeca7662a11e5fb8e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page