A Python package for Azure Datalake Storage adlsgen2 [abfss://] and Microsoft Fabric Lakehouse [abfss://], Google Cloud Storage [gs://bucket], AWS S3 bucket [s3://bucket] enables format detection and schema retrieval for Iceberg, Delta, and Parquet formats.It also helps identify paritioned columns for parquet datasets.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

DataLakeSurfer

A unified toolkit for detecting, exploring, and validating data lake formats across Azure Data Lake Storage Gen2 , Microsoft Fabric Lakehouse, GCP Google Storage and AWS S3.

DataLakeSurfer lets you:

Detect whether a directory is Iceberg, Delta, or Parquet.
Retrieve normalized schema for supported formats.
Detect partition columns for Parquet datasets.
Query Delta Format Query Delta for through SQL.
Validate configuration inputs with Pydantic models.
Work seamlessly with Azure ADLS Gen2 and Fabric Lakehouse, AWS S3 and GCP GS.

ðŸš€ Installation

pip install datalakesurfer

Requirements:

Python 3.9+
pyarrowfs-adlgen2==0.2.5
adlfs==2024.12.0
azure-identity==1.17.1
azure-storage-file-datalake==12.21.0
numpy==1.26.4
pandas==1.5.3
cryptography==43.0.1
duckdb==1.3.2
deltalake==1.1.4
pydantic==2.11.7
gcsfs==2025.7.0
s3fs==2025.7.0

âš¡ Quick Start

Generate Token

from azure.identity import DefaultAzureCredential, ClientSecretCredential
from azure.core.credentials import AccessToken

class CustomTokenCredential():
    def __init__(self):
        self.credential = DefaultAzureCredential()

    def get_token(self):
        token_response = self.credential.get_token("https://storage.azure.com/.default")

        return (token_response.token, token_response.expires_on)

class CustomTokenCredentialSeconday(object):
    def __init__(self,token,expires_on):
        self.token = token
        self.expires_on = expires_on

    def get_token(self, *scopes, **kwargs):
        return AccessToken(self.token, self.expires_on)

token,expires_on = CustomTokenCredential().get_token()
print(token)
print(expires_on)

from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="salesorder",
    token=token,
    expires_on=expires_on).detect_format()
print(f)

# â†’ {'status': 'success', 'format': 'delta'}

ðŸ§ Core Concepts

Format Detectors: Identify dataset format (iceberg, delta, parquet).
Schema Retrievers: Extract a normalized schema regardless of source format.
Partition Detectors: Identify partitioning and list partition columns.
Models: All incoming parameters validated by Pydantic for correctness.

ðŸ“š Usage Examples

1. Detect Format (ADLS)

from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
    file_system_name="iceberg-container",
    directory_path="customerdw/mydb/product",
    token=token,
    expires_on=expires_on).detect_format()
print(f)
# {'status': 'success', 'format': 'iceberg'}

from datalakesurfer.format_detector import FormatDetector
f = FormatDetector(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="SaleParquetData",
    token=token,
    expires_on=expires_on).detect_format()
print(f)
# {'status': 'success', 'format': 'parquet'}

2. Detect Format (Fabric)

from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/salesorder",
    token=token,
    expires_on=expires_on).detect_format()
print(f)

#abfss://devworkspace@onelake.dfs.fabric.microsoft.com/devlakehouse.Lakehouse/Files/salesorder
# {'status': 'success', 'format': 'delta'}

3. Detect Format (GCS)

import json
with open("/Workspace/Users/<>/fresh-replica-393006-0398d554d3fa.json", "r") as f:
    service_account_dict = json.load(f)

from datalakesurfer.gcs_format_detector import GCSFormatDetector
f = GCSFormatDetector(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="Country").detect_format()
print(f)
# {'status': 'success', 'format': 'delta'}

f = GCSFormatDetector(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="ParquetDataSource").detect_format()
print(f)
# {'status': 'success', 'format': 'parquet'}

f = GCSFormatDetector(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="customerdw/mydb/product").detect_format()
print(f)
# {'status': 'success', 'format': 'iceberg'}

4. Retrieve Schema

For Parquet/Delta/Iceberg in ADLS:

from datalakesurfer.schemas.delta_schema import DeltaSchemaRetriever
s = DeltaSchemaRetriever(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="salesorder",
    token=token,
    expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}

from datalakesurfer.schemas.iceberg_schema import IcebergSchemaRetriever
s = IcebergSchemaRetriever(account_name="adlssynapseeus01",
    file_system_name="iceberg-container",
    directory_path="customerdw/mydb/product",
    token=token,
    expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'productId', 'dtype': 'string'}, {'column_name': 'productName', 'dtype': 'string'}, {'column_name': 'productDescription', 'dtype': 'string'}, {'column_name': 'productPrice', 'dtype': 'double'}, {'column_name': 'dataModified', 'dtype': 'timestamp'}]}

from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
s = ParquetSchemaRetriever(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="SaleParquetData",
    token=token,
    expires_on=expires_on).get_schema()
print(s)

#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}

4. Retrieve Schema (GCS)

For Delta/Iceberg/Parquet in GCS:

from datalakesurfer.schemas.gcs_delta_schema import GCSDeltaSchemaRetriever
s = GCSDeltaSchemaRetriever(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="Country").get_schema()
print(s)
# [{'column_name': 'SalesId', 'dtype': 'string'}, ...]

from datalakesurfer.schemas.gcs_iceberg_schema import GCSIcebergSchemaRetriever
s = GCSIcebergSchemaRetriever(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="customerdw/mydb/product").get_schema()
print(s)
# [{'column_name': 'productId', 'dtype': 'string'}, ...]

from datalakesurfer.schemas.gcs_parquet_schema import GCSParquetSchemaRetriever
s = GCSParquetSchemaRetriever(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="ParquetDataSource").get_schema()
print(s)
# [{'column_name': 'SalesId', 'dtype': 'string'}, ...]

5. Detect Partitions

ADLS:

from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
f = ParquetSchemaRetriever(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="LargeSaleParquetData2Billion",
    token=token,
    expires_on=expires_on).detect_partitions()
print(f)
#{'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesAmount', 'dtype': 'int32'}]}

from datalakesurfer.schemas.parquet_schema import ParquetSchemaRetriever
f = ParquetSchemaRetriever(account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    directory_path="SaleParquetData",
    token=token,
    expires_on=expires_on).detect_partitions()
print(f)
#{'status': 'success', 'isPartitioned': False, 'partition_columns': []}

Fabric:

from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/Product",
    token=token,
    expires_on=expires_on).detect_format()
print(f)
#{'status': 'success', 'format': 'parquet'}

from datalakesurfer.fabric_format_detector import FabricFormatDetector
f = FabricFormatDetector(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/FlightTripData",
    token=token,
    expires_on=expires_on).detect_format()
print(f)
#{'status': 'success', 'format': 'iceberg'}

from datalakesurfer.schemas.fabric_delta_schema import FabricDeltaSchemaRetriever
s = FabricDeltaSchemaRetriever(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/salesorder",
    token=token,
    expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'SalesId', 'dtype': 'string'}, {'column_name': 'ProductName', 'dtype': 'string'}, {'column_name': 'SalesDateTime', 'dtype': 'timestamp'}, {'column_name': 'SalesAmount', 'dtype': 'int'}, {'column_name': 'EventProcessingTime', 'dtype': 'timestamp'}]}

from datalakesurfer.schemas.fabric_iceberg_schema import FabricIcebergSchemaRetriever
s = FabricIcebergSchemaRetriever(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/FlightTripData",
    token=token,
    expires_on=expires_on).get_schema()
print(s)
#{'status': 'success', 'schema': [{'column_name': 'flight_id', 'dtype': 'int'}, {'column_name': 'airline_code', 'dtype': 'int'}, {'column_name': 'passenger_count', 'dtype': 'int'}, {'column_name': 'flight_duration_minutes', 'dtype': 'bigint'}, {'column_name': 'baggage_weight', 'dtype': 'float'}, {'column_name': 'ticket_price', 'dtype': 'double'}, {'column_name': 'tax_amount', 'dtype': 'decimal(38,18)', 'typeproperties': {'precision': 10, 'scale': 2}}, {'column_name': 'destination', 'dtype': 'string'}, {'column_name': 'flight_data', 'dtype': 'string'}, {'column_name': 'is_international', 'dtype': 'boolean'}, {'column_name': 'departure_timestamp', 'dtype': 'timestamp'}, {'column_name': 'arrival_timestamp_ntz', 'dtype': 'timestamp'}, {'column_name': 'flight_date', 'dtype': 'date'}]}

from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
s = FabricParquetSchemaRetriever(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/Product",
    token=token,
    expires_on=expires_on).get_schema()
print(s)

# {'status': 'success', 'schema': [{'column_name': 'name', 'dtype': 'string'}, {'column_name': 'age', 'dtype': 'long'}]}

from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
s = FabricParquetSchemaRetriever(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/Product",
    token=token,
    expires_on=expires_on).detect_partitions()
print(s)

#{'status': 'success', 'isPartitioned': False, 'partition_columns': []}

from datalakesurfer.schemas.fabric_parquet_schema import FabricParquetSchemaRetriever
f = FabricParquetSchemaRetriever(account_name="onelake",
    file_system_name="devworkspace",
    directory_path="devlakehouse.Lakehouse/Files/Customer",
    token=token,
    expires_on=expires_on).detect_partitions()
print(f)

#{'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'city', 'dtype': 'string'}]}

5. Detect Partitions (GCS)

from datalakesurfer.schemas.gcs_parquet_schema import GCSParquetSchemaRetriever
f = GCSParquetSchemaRetriever(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="ParquetDataSource").detect_partitions()
print(f)
# {'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'ProductName', 'dtype': 'string'}, ...]}

f = GCSParquetSchemaRetriever(service_account_info=service_account_dict,
    file_system_name="storagebucket0001",
    directory_path="ParquetDataSource").detect_partitions()
print(f)
# {'status': 'success', 'isPartitioned': False, 'partition_columns': []}

6. AWS S3 Usage

Set your AWS credentials:

AWS_ACCESS_KEY_ID = "<YOUR_AWS_ACCESS_KEY_ID>"
AWS_SECRET_ACCESS_KEY = "<YOUR_AWS_SECRET_ACCESS_KEY>"
AWS_REGION = "us-east-1"

Detect Format (S3)

from datalakesurfer.s3_format_detector import S3FormatDetector
f = S3FormatDetector(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="Country").detect_format()
print(f)
# {'status': 'success', 'format': 'delta'}

f = S3FormatDetector(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="ParquetDataSource").detect_format()
print(f)
# {'status': 'success', 'format': 'parquet'}

f = S3FormatDetector(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="customerdw/mydb/product").detect_format()
print(f)
# {'status': 'success', 'format': 'iceberg'}

Retrieve Schema (S3)

from datalakesurfer.schemas.s3_delta_schema import S3DeltaSchemaRetriever
s = S3DeltaSchemaRetriever(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="Country").get_schema()
print(s)
# [{'column_name': 'SalesId', 'dtype': 'string'}, ...]

from datalakesurfer.schemas.s3_iceberg_schema import S3IcebergSchemaRetriever
s = S3IcebergSchemaRetriever(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="customerdw/mydb/product").get_schema()
print(s)
# [{'column_name': 'productId', 'dtype': 'string'}, ...]

from datalakesurfer.schemas.s3_parquet_schema import S3ParquetSchemaRetriever
s = S3ParquetSchemaRetriever(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="ParquetDataSource").get_schema()
print(s)
# [{'column_name': 'SalesId', 'dtype': 'string'}, ...]

Detect Partitions (S3 Parquet)

from datalakesurfer.schemas.s3_parquet_schema import S3ParquetSchemaRetriever
f = S3ParquetSchemaRetriever(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="ParquetDataSource").detect_partitions()
print(f)
# {'status': 'success', 'isPartitioned': True, 'partition_columns': [{'column_name': 'ProductName', 'dtype': 'string'}, ...]}

f = S3ParquetSchemaRetriever(aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION,
    file_system_name="devs3bucket001",
    directory_path="ParquetDataSource").detect_partitions()
print(f)
# {'status': 'success', 'isPartitioned': False, 'partition_columns': []}

Query Delta Through SQL (GCS / S3 / ADLS Gen2 / Microsoft Fabric)

# GCS
from datalakesurfer.query.gcs_delta_query import GCSDeltaQueryRetriever
import json
tables = {
    "sales": "storagebucket0001/SalesOrder",
    "country": "storagebucket0001/Country"
}
query = "SELECT * FROM sales CROSS JOIN country"
retriever = GCSDeltaQueryRetriever(service_account_info=service_account_dict)
result_df = retriever.query(tables=tables, query=query)
print(result_df)

# S3
from datalakesurfer.query.s3_delta_query import S3DeltaQueryRetriever
import json
tables = {
    "sales": "devs3bucket001/SalesOrder",
    "country": "devs3bucket001/Country"
}
query = "SELECT * FROM sales CROSS JOIN country"
retriever = S3DeltaQueryRetriever(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    aws_region=AWS_REGION
)
result_df = retriever.query(tables=tables, query=query)
print(result_df)

# Microsoft Fabric
from datalakesurfer.query.fabric_delta_query import FabricDeltaQueryRetriever
tables = {
    "sales": "devlakehouse.Lakehouse/Files/salesorder",
    "salesorder": "devlakehouse.Lakehouse/Files/salesorder"
}
query = "SELECT * FROM sales UNION ALL SELECT * FROM salesorder"
retriever = FabricDeltaQueryRetriever(
    account_name="onelake",
    file_system_name="devworkspace",
    token=token,
    expires_on=expires_on
)
result_df = retriever.query(tables=tables, query=query)
print(result_df)

# ADLS Gen2
from datalakesurfer.query.adls_delta_query import ADLSDeltaQueryRetriever
tables = {
    "sales": "salesorder"
}
query = "SELECT * FROM sales"
retriever = ADLSDeltaQueryRetriever(
    account_name="adlssynapseeus01",
    file_system_name="datacontainer",
    token=token,
    expires_on=expires_on
)
result_df = retriever.query(tables=tables, query=query)
print(result_df)

ðŸ“¦ Supported Formats

Format	ADLS Gen2 (Azure)	Fabric OneLake (Azure)	GCS (GCP)	S3 (AWS)
Parquet	.	.	.	.
Delta	.	.	.	.
Iceberg	.	.	.	.

ðŸ› Development & Testing

Run tests:

pytest

ðŸ“„ License

MIT License â€“ See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.6

Sep 7, 2025

0.1.5

Sep 6, 2025

0.1.4

Sep 5, 2025

This version

0.1.3

Aug 29, 2025

0.1.2

Aug 13, 2025

0.1.1

Aug 13, 2025

0.1.0

Aug 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

datalakesurfer-0.1.3-py3-none-any.whl (49.3 kB view details)

Uploaded Aug 29, 2025 Python 3

File details

Details for the file datalakesurfer-0.1.3-py3-none-any.whl.

File metadata

Download URL: datalakesurfer-0.1.3-py3-none-any.whl
Upload date: Aug 29, 2025
Size: 49.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for datalakesurfer-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9868269aa514b1f3a2bc8e4cc4280be585e0543eef8b6c62bc6fae4956cd2088`
MD5	`91492516fda7c850bad5e6d5a194c414`
BLAKE2b-256	`4d031dfbd1dbb57b201bc166ba52d7020779980bd77fc712c89e7ef7a8057532`

See more details on using hashes here.

datalakesurfer 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataLakeSurfer

ðŸ“‘ Table of Contents

ðŸš€ Installation

âš¡ Quick Start

Generate Token

ðŸ§ Core Concepts

ðŸ“š Usage Examples

1. Detect Format (ADLS)

2. Detect Format (Fabric)

3. Detect Format (GCS)

4. Retrieve Schema

4. Retrieve Schema (GCS)

5. Detect Partitions

5. Detect Partitions (GCS)

6. AWS S3 Usage

Detect Format (S3)

Retrieve Schema (S3)

Detect Partitions (S3 Parquet)

Query Delta Through SQL (GCS / S3 / ADLS Gen2 / Microsoft Fabric)

ðŸ“¦ Supported Formats

ðŸ› Development & Testing

ðŸ“„ License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes