A Python package for Azure Datalake Storage and Microsoft Fabric Lakehouse, enables format detection and schema retrieval for Iceberg, Delta, and Parquet formats.It also helps identify paritioned columns for parquet datasets.
Project description
DataLakeSurfer
A unified toolkit for detecting, exploring, and validating data lake formats across Azure Data Lake Storage Gen2 and Microsoft Fabric Lakehouse.
DataLakeSurfer lets you:
- Detect whether a directory is Iceberg, Delta, or Parquet.
- Retrieve normalized schema for supported formats.
- Detect partition columns for Parquet datasets.
- Validate configuration inputs with Pydantic models.
- Work seamlessly with both ADLS Gen2 and Fabric Lakehouse.
📑 Table of Contents
- Installation
- Quick Start
- Core Concepts
- Usage Examples
- Supported Formats
- Development & Testing
- License
🚀 Installation
pip install datalakesurfer
Requirements:
- Python 3.9+
pyarrowfs-adlgen2==0.2.5adlfs==2023.12.0azure-identity==1.17.1azure-storage-file-datalake==12.17.0numpy==1.26.4pandas==1.5.3cryptography==43.0.1duckdb==1.1.1deltalake==0.20.2pydantic==2.11.7
âš¡ Quick Start
Generate Token
from azure.identity import DefaultAzureCredential, ClientSecretCredential
from azure.core.credentials import AccessToken
class CustomTokenCredential():
def __init__(self):
self.credential = DefaultAzureCredential()
def get_token(self):
token_response = self.credential.get_token("https://storage.azure.com/.default")
return (token_response.token, token_response.expires_on)
class CustomTokenCredentialSeconday(object):
def __init__(self,token,expires_on):
self.token = token
self.expires_on = expires_on
def get_token(self, *scopes, **kwargs):
return AccessToken(self.token, self.expires_on)
token,expires_on = CustomTokenCredential().get_token()
print(token)
print(expires_on)
from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="salesorder",
token=token,
expires_on=expires_on).detect_format()
print(f)
# → {'status': 'success', 'format': 'delta'}
🧠Core Concepts
- Format Detectors: Identify dataset format (
iceberg,delta,parquet). - Schema Retrievers: Extract a normalized schema regardless of source format.
- Partition Detectors: Identify partitioning and list partition columns.
- Models: All incoming parameters validated by Pydantic for correctness.
📚 Usage Examples
1. Detect Format (ADLS)
from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
file_system_name="iceberg-container",
directory_path="customerdw/mydb/product",
token=token,
expires_on=expires_on).detect_format()
print(f)
# {'status': 'success', 'format': 'iceberg'}
from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="SaleParquetData",
token=token,
expires_on=expires_on).detect_format()
print(f)
# {'status': 'success', 'format': 'parquet'}
2. Detect Format (Fabric)
from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/salesorder",
token=token,
expires_on=expires_on).detect_format()
print(f)
#abfss://devworkspace@onelake.dfs.fabric.microsoft.com/devlakehouse.Lakehouse/Files/salesorder
# {'status': 'success', 'format': 'delta'}
3. Retrieve Schema
For Parquet/Delta/Iceberg in ADLS:
from datalakesurfer.schemas.delta_schema import DeltaSchemaRetriever
s = DeltaSchemaRetriever(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="salesorder",
token=token,
expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}
from datalakesurfer.schemas.iceberg_schema import IcebergSchemaRetriever
s = IcebergSchemaRetriever(account_name="adlssynapseeus01",
file_system_name="iceberg-container",
directory_path="customerdw/mydb/product",
token=token,
expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'productId', 'dtype': 'string'}, {'column_name': 'productName', 'dtype': 'string'}, {'column_name': 'productDescription', 'dtype': 'string'}, {'column_name': 'productPrice', 'dtype': 'double'}, {'column_name': 'dataModified', 'dtype': 'timestamp'}]}
from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
s = ParquetSchemaRetriever(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="SaleParquetData",
token=token,
expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}
4. Detect Partitions
ADLS:
from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
f = ParquetSchemaRetriever(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="LargeSaleParquetData2Billion",
token=token,
expires_on=expires_on).detect_partitions()
print(f)
#{'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesAmount', 'dtype': 'int32'}]}
from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
f = ParquetSchemaRetriever(account_name="adlssynapseeus01",
file_system_name="datacontainer",
directory_path="SaleParquetData",
token=token,
expires_on=expires_on).detect_partitions()
print(f)
#{'status': 'success', 'isPartitioned': False, 'partition_columns': []}
Fabric:
from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/Product",
token=token,
expires_on=expires_on).detect_format()
print(f)
#{'status': 'success', 'format': 'parquet'}
from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/FlightTripData",
token=token,
expires_on=expires_on).detect_format()
print(f)
#{'status': 'success', 'format': 'iceberg'}
from datalakesurfer.schemas.fabric_delta_schema import FabricDeltaSchemaRetriever
s = FabricDeltaSchemaRetriever(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/salesorder",
token=token,
expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}
from datalakesurfer.schemas.fabric_iceberg_schema import FabricIcebergSchemaRetriever
s = FabricIcebergSchemaRetriever(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/FlightTripData",
token=token,
expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'flight_id', 'dtype': 'int'}, {'column_name': 'airline_code', 'dtype': 'int'}, {'column_name': 'passenger_count', 'dtype': 'int'}, {'column_name': 'flight_duration_minutes', 'dtype': 'bigint'}, {'column_name': 'baggage_weight', 'dtype': 'float'}, {'column_name': 'ticket_price', 'dtype': 'double'}, {'column_name': 'tax_amount', 'dtype': 'decimal(38,18)', 'typeproperties': {'precision': 10, 'scale': 2}}, {'column_name': 'destination', 'dtype': 'string'}, {'column_name': 'flight_data', 'dtype': 'string'}, {'column_name': 'is_international', 'dtype': 'boolean'}, {'column_name': 'departure_timestamp', 'dtype': 'timestamp'}, {'column_name': 'arrival_timestamp_ntz', 'dtype': 'timestamp'}, {'column_name': 'flight_date', 'dtype': 'date'}]}
from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
s = FabricParquetSchemaRetriever(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/Product",
token=token,
expires_on=expires_on).get_schema()
print(s)
# {'status': 'success', 'schema': [{'column_name': 'name', 'dtype': 'string'}, {'column_name': 'age', 'dtype': 'long'}]}
from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
s = FabricParquetSchemaRetriever(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/Product",
token=token,
expires_on=expires_on).detect_partitions()
print(s)
#{'status': 'success', 'isPartitioned': False, 'partition_columns': []}
from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
f = FabricParquetSchemaRetriever(account_name="onelake",
file_system_name="devworkspace",
directory_path="devlakehouse.Lakehouse/Files/Customer",
token=token,
expires_on=expires_on).detect_partitions()
print(f)
#{'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'city', 'dtype': 'string'}]}
📦 Supported Formats
| Format | ADLS Gen2 | Fabric OneLake |
|---|---|---|
| Parquet | ✅ | ✅ |
| Delta | ✅ | ✅ |
| Iceberg | ✅ | ✅ |
🛠Development & Testing
Run tests:
pytest
📄 License
MIT License – See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datalakesurfer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datalakesurfer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab87c7ad0d6ba3745bcfbfae7c6b44758c04e1bad1d16388cc26bb09e7113a2a
|
|
| MD5 |
5be8e975f0013b686e3ac8d44c5db9ba
|
|
| BLAKE2b-256 |
54590d85535c66a4ee116484278f28c6be95e9459c94023d281d495a4b000816
|