Datadock is a PySpark-based data interoperability library. It automatically detects schemas from heterogeneous files (CSV, JSON, Parquet), groups them by structural similarity, and performs standardized batch reads. Designed for pipelines handling non-uniform large-scale data, enabling robust integration and reuse in distributed environments.
Project description
Datadock
Datadock is a Python library built on top of PySpark, designed to simplify data interoperability between files of different formats and schemas in modern data engineering pipelines.
It automatically detects schemas from CSV, JSON and Parquet files, groups structurally similar files, and allows standardized reading of all grouped files into a single Spark DataFrame — even in highly heterogeneous datasets.
✨ Key Features
- 🚀 Automatic parsing of multiple file formats:
.csv,.json,.parquet - 🧠 Schema-based file grouping by structural similarity
- 📊 Auto-selection of dominant schemas
- 🛠️ Unified read across similar files into a single PySpark DataFrame
- 🔍 Schema insight for diagnostics and inspection
🔧 Installation
pip install datadock
🗂️ Expected Input Structure
Place your data files (CSV, JSON or Parquet) inside a folder. The library will automatically detect supported files and organize them by schema similarity.
/data/input/
├── sales_2020.csv
├── sales_2021.csv
├── products.json
├── archive.parquet
├── log.parquet
🧪 Usage Example
from datadock import scan_schema, get_schema_info, read_data
path = "/path/to/your/data"
# Logs schema groups detected and returns their metadata
scan_schema(path)
# Retrieves schema metadata programmatically
info = get_schema_info(path)
print(info)
# Loads all files from schema group 1 into a single DataFrame
df = read_data(path, schema_id=1, logs=True)
df.show()
📌 Public API
scan_schema(path, min_similarity=0.8, recursive=False)
Logs the identified schema groups found in the specified folder and returns their metadata (same structure as get_schema_info).
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
— | Folder containing data files |
min_similarity |
float |
0.8 |
Minimum Jaccard similarity (0–1) to group files together |
recursive |
bool |
False |
Whether to scan subdirectories recursively |
get_schema_info(path, min_similarity=0.8, recursive=False)
Returns a list of dictionaries containing:
schema_id: ID of the schema groupfile_count: number of files in the groupcolumn_count: number of columns in the schemafiles: list of file names in the group
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
— | Folder containing data files |
min_similarity |
float |
0.8 |
Minimum Jaccard similarity (0–1) to group files together |
recursive |
bool |
False |
Whether to scan subdirectories recursively |
read_data(path, schema_id=None, logs=False, spark=None, min_similarity=0.8, recursive=False)
Reads and merges all files that share the same schema into a single Spark DataFrame.
If schema_id is not specified, defaults to schema group 1 (first detected).
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
— | Folder containing data files |
schema_id |
int |
None |
ID of the schema group to read. Defaults to 1 |
logs |
bool |
False |
Whether to print detailed logs during loading |
spark |
SparkSession |
None |
Active SparkSession to use. Creates one if not provided |
min_similarity |
float |
0.8 |
Minimum Jaccard similarity (0–1) to group files together |
recursive |
bool |
False |
Whether to scan subdirectories recursively |
✅ Requirements
- Python 3.10+
- PySpark
📚 Motivation
In real-world data engineering workflows, it's common to deal with files that represent the same data domain but have slight structural variations — such as missing columns, different orders, or evolving schemas. Datadock automates the process of grouping, inspecting, and reading these files reliably, allowing you to build pipelines that are schema-aware, scalable, and format-agnostic.
📄 License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datadock-0.1.4.tar.gz.
File metadata
- Download URL: datadock-0.1.4.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdf85b3c3efccc624301b54ccad3271a381e9800724044e3fdafb624ca0afb91
|
|
| MD5 |
fe71d855f65cf6534b653275c196223b
|
|
| BLAKE2b-256 |
67a92d1d12fef4e69ad0fcdaf1a8f5cafbe6d196893e9cc8b7a0dbebdd24bfa3
|
File details
Details for the file datadock-0.1.4-py3-none-any.whl.
File metadata
- Download URL: datadock-0.1.4-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c029378091bfa6da4338217807e6207228769f7fead3e4d2d20e6d3bf93b416a
|
|
| MD5 |
74fb6bde7234b770d46bccaad5a7e3fc
|
|
| BLAKE2b-256 |
9c21b4d1beb12975af0e2ab7039b60a4d87c14823154c043b59ce22d8c761177
|