Datadock is a PySpark-based data interoperability library. It automatically detects schemas from heterogeneous files (CSV, JSON, Parquet), groups them by structural similarity, and performs standardized batch reads. Designed for pipelines handling non-uniform large-scale data, enabling robust integration and reuse in distributed environments.

These details have not been verified by PyPI

Project description

Datadock

Datadock is a Python library built on top of PySpark, designed to simplify data interoperability between files of different formats and schemas in modern data engineering pipelines.

It automatically detects schemas from CSV, JSON and Parquet files, groups structurally similar files, and allows standardized reading of all grouped files into a single Spark DataFrame — even in highly heterogeneous datasets.

🆕 Datadock now supports Databricks!

You can now use Datadock directly inside your Databricks notebooks and jobs. Pass any DBFS path — with dbfs:/, dbfs://, or /dbfs/ — and Datadock will handle the rest automatically, no extra configuration needed.

from datadock import read_data

df = read_data("dbfs:/mnt/bronze/sales")
df.show()

See the Databricks Support section for full details on supported path formats and what's coming next.

✨ Key Features

🚀 Automatic parsing of multiple file formats: .csv, .json, .parquet
🧠 Schema-based file grouping by structural similarity
📊 Auto-selection of dominant schemas
🛠️ Unified read across similar files into a single PySpark DataFrame
🔍 Schema insight for diagnostics and inspection
☁️ Databricks DBFS support — works natively with dbfs:/ and /dbfs/ paths

🔧 Installation

pip install datadock

🗂️ Expected Input Structure

Place your data files (CSV, JSON or Parquet) inside a folder. The library will automatically detect supported files and organize them by schema similarity.

/data/input/
├── sales_2020.csv
├── sales_2021.csv
├── products.json
├── archive.parquet
├── log.parquet

🧪 Usage Example

from datadock import scan_schema, get_schema_info, read_data

path = "/path/to/your/data"

# Logs schema groups detected and returns their metadata
scan_schema(path)

# Retrieves schema metadata programmatically
info = get_schema_info(path)
print(info)

# Loads all files from schema group 1 into a single DataFrame
df = read_data(path, schema_id=1, logs=True)
df.show()

📌 Public API

`scan_schema(path, min_similarity=0.8, recursive=False)`

Logs the identified schema groups found in the specified folder and returns their metadata (same structure as get_schema_info).

Parameter	Type	Default	Description
`path`	`str`	—	Folder containing data files
`min_similarity`	`float`	`0.8`	Minimum Jaccard similarity (0–1) to group files together
`recursive`	`bool`	`False`	Whether to scan subdirectories recursively

`get_schema_info(path, min_similarity=0.8, recursive=False)`

Returns a list of dictionaries containing:

schema_id: ID of the schema group
file_count: number of files in the group
column_count: number of columns in the schema
files: list of file names in the group

Parameter	Type	Default	Description
`path`	`str`	—	Folder containing data files
`min_similarity`	`float`	`0.8`	Minimum Jaccard similarity (0–1) to group files together
`recursive`	`bool`	`False`	Whether to scan subdirectories recursively

`read_data(path, schema_id=None, logs=False, spark=None, min_similarity=0.8, recursive=False)`

Reads and merges all files that share the same schema into a single Spark DataFrame. If schema_id is not specified, defaults to schema group 1 (first detected).

Parameter	Type	Default	Description
`path`	`str`	—	Folder containing data files
`schema_id`	`int`	`None`	ID of the schema group to read. Defaults to `1`
`logs`	`bool`	`False`	Whether to print detailed logs during loading
`spark`	`SparkSession`	`None`	Active SparkSession to use. Creates one if not provided
`min_similarity`	`float`	`0.8`	Minimum Jaccard similarity (0–1) to group files together
`recursive`	`bool`	`False`	Whether to scan subdirectories recursively

☁️ Databricks Support

Datadock supports reading files from the Databricks File System (DBFS) via the local FUSE mount that Databricks exposes on every cluster.

The following path formats are accepted and automatically normalized:

# All three are equivalent and work out of the box
read_data("dbfs:/mnt/bronze/sales", spark=spark)
read_data("dbfs://mnt/bronze/sales", spark=spark)
read_data("/dbfs/mnt/bronze/sales", spark=spark)

Supported path formats:

Path format Example Status

DBFS native prefix dbfs:/mnt/... ✅ Available

DBFS double-slash prefix dbfs://mnt/... ✅ Available

DBFS FUSE mount /dbfs/mnt/... ✅ Available

Local filesystem /local/path/... ✅ Available

Azure Data Lake (ADLS) abfss://container@... 🔜 Coming soon

AWS S3 s3://bucket/... 🔜 Coming soon

Google Cloud Storage gs://bucket/... 🔜 Coming soon

Azure Blob Storage wasbs://container@... 🔜 Coming soon

Support for cloud storage paths (ADLS, S3, GCS, Azure Blob) is planned for upcoming releases.

Path format	Example	Status
DBFS native prefix	`dbfs:/mnt/...`	✅ Available
DBFS double-slash prefix	`dbfs://mnt/...`	✅ Available
DBFS FUSE mount	`/dbfs/mnt/...`	✅ Available
Local filesystem	`/local/path/...`	✅ Available
Azure Data Lake (ADLS)	`abfss://container@...`	🔜 Coming soon
AWS S3	`s3://bucket/...`	🔜 Coming soon
Google Cloud Storage	`gs://bucket/...`	🔜 Coming soon
Azure Blob Storage	`wasbs://container@...`	🔜 Coming soon

✅ Requirements

Python 3.10+
PySpark

📚 Motivation

In real-world data engineering workflows, it's common to deal with files that represent the same data domain but have slight structural variations — such as missing columns, different orders, or evolving schemas. Datadock automates the process of grouping, inspecting, and reading these files reliably, allowing you to build pipelines that are schema-aware, scalable, and format-agnostic.

📄 License

This project is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Mar 27, 2026

0.2.0

Mar 27, 2026

0.1.4

Mar 27, 2026

0.1.3

Feb 20, 2026

0.1.2

Jun 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadock-0.2.1.tar.gz (7.8 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datadock-0.2.1-py3-none-any.whl (9.8 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file datadock-0.2.1.tar.gz.

File metadata

Download URL: datadock-0.2.1.tar.gz
Upload date: Mar 27, 2026
Size: 7.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure

File hashes

Hashes for datadock-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`1443609145161dc8e3334cf38ff3e6c89f4dc2598a1af6223284187496b21905`
MD5	`1fc057ada09f7e1655c3607cb9372d44`
BLAKE2b-256	`a555e9fccc4c2aca8f605307c8e9c9d1daffcb98c787981838785b781cef2c93`

See more details on using hashes here.

File details

Details for the file datadock-0.2.1-py3-none-any.whl.

File metadata

Download URL: datadock-0.2.1-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.17.0-1008-azure

File hashes

Hashes for datadock-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a10410c3597be11c464a3ca6433c10ed64124e84fba64468f491a120702c7c4`
MD5	`5af9d0abb690525f57b612a9dda50d2f`
BLAKE2b-256	`e110a598ce12937cc626343baab40f9e84915701bcff38b934eab589b489d1a8`

See more details on using hashes here.

datadock 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Datadock

🆕 Datadock now supports Databricks!

✨ Key Features

🔧 Installation

🗂️ Expected Input Structure

🧪 Usage Example

📌 Public API

`scan_schema(path, min_similarity=0.8, recursive=False)`

`get_schema_info(path, min_similarity=0.8, recursive=False)`

`read_data(path, schema_id=None, logs=False, spark=None, min_similarity=0.8, recursive=False)`

☁️ Databricks Support

✅ Requirements

📚 Motivation

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes