Skip to main content

Datadock is a PySpark-based data interoperability library. It automatically detects schemas from heterogeneous files (CSV, JSON, Parquet), groups them by structural similarity, and performs standardized batch reads. Designed for pipelines handling non-uniform large-scale data, enabling robust integration and reuse in distributed environments.

Project description

Datadock

Datadock is a Python library built on top of PySpark, designed to simplify data interoperability between files of different formats and schemas in modern data engineering pipelines.

It automatically detects schemas from CSV, JSON and Parquet files, groups structurally similar files, and allows standardized reading of all grouped files into a single Spark DataFrame — even in highly heterogeneous datasets.

✨ Key Features

  • 🚀 Automatic parsing of multiple file formats: .csv, .json, .parquet
  • 🧠 Schema-based file grouping by structural similarity
  • 📊 Auto-selection of dominant schemas
  • 🛠️ Unified read across similar files into a single PySpark DataFrame
  • 🔍 Schema insight for diagnostics and inspection

🔧 Installation

pip install datadock

🗂️ Expected Input Structure

Place your data files (CSV, JSON or Parquet) inside a folder. The library will automatically detect supported files and organize them by schema similarity.

/data/input/
├── sales_2020.csv
├── sales_2021.csv
├── products.json
├── archive.parquet
├── log.parquet

🧪 Usage Example

from datadock import scan_schema, get_schema_info, read_data

path = "/path/to/your/data"

# Logs schema groups detected and returns their metadata
scan_schema(path)

# Retrieves schema metadata programmatically
info = get_schema_info(path)
print(info)

# Loads all files from schema group 1 into a single DataFrame
df = read_data(path, schema_id=1, logs=True)
df.show()

📌 Public API

scan_schema(path, min_similarity=0.8, recursive=False)

Logs the identified schema groups found in the specified folder and returns their metadata (same structure as get_schema_info).

Parameter Type Default Description
path str Folder containing data files
min_similarity float 0.8 Minimum Jaccard similarity (0–1) to group files together
recursive bool False Whether to scan subdirectories recursively

get_schema_info(path, min_similarity=0.8, recursive=False)

Returns a list of dictionaries containing:

  • schema_id: ID of the schema group
  • file_count: number of files in the group
  • column_count: number of columns in the schema
  • files: list of file names in the group
Parameter Type Default Description
path str Folder containing data files
min_similarity float 0.8 Minimum Jaccard similarity (0–1) to group files together
recursive bool False Whether to scan subdirectories recursively

read_data(path, schema_id=None, logs=False, spark=None, min_similarity=0.8, recursive=False)

Reads and merges all files that share the same schema into a single Spark DataFrame. If schema_id is not specified, defaults to schema group 1 (first detected).

Parameter Type Default Description
path str Folder containing data files
schema_id int None ID of the schema group to read. Defaults to 1
logs bool False Whether to print detailed logs during loading
spark SparkSession None Active SparkSession to use. Creates one if not provided
min_similarity float 0.8 Minimum Jaccard similarity (0–1) to group files together
recursive bool False Whether to scan subdirectories recursively

✅ Requirements

  • Python 3.10+
  • PySpark

📚 Motivation

In real-world data engineering workflows, it's common to deal with files that represent the same data domain but have slight structural variations — such as missing columns, different orders, or evolving schemas. Datadock automates the process of grouping, inspecting, and reading these files reliably, allowing you to build pipelines that are schema-aware, scalable, and format-agnostic.

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadock-0.1.4.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadock-0.1.4-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file datadock-0.1.4.tar.gz.

File metadata

  • Download URL: datadock-0.1.4.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure

File hashes

Hashes for datadock-0.1.4.tar.gz
Algorithm Hash digest
SHA256 cdf85b3c3efccc624301b54ccad3271a381e9800724044e3fdafb624ca0afb91
MD5 fe71d855f65cf6534b653275c196223b
BLAKE2b-256 67a92d1d12fef4e69ad0fcdaf1a8f5cafbe6d196893e9cc8b7a0dbebdd24bfa3

See more details on using hashes here.

File details

Details for the file datadock-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: datadock-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure

File hashes

Hashes for datadock-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c029378091bfa6da4338217807e6207228769f7fead3e4d2d20e6d3bf93b416a
MD5 74fb6bde7234b770d46bccaad5a7e3fc
BLAKE2b-256 9c21b4d1beb12975af0e2ab7039b60a4d87c14823154c043b59ce22d8c761177

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page