ADF-style data ingestion for Databricks — pick a source (Volume, ADLS, S3, Database, REST API), fill a few fields, run
Project description
DashIngest — Databricks Library
Part of the Dashlibs suite — Databricks libraries built for business users.
ADF-style data ingestion: pick a source kind, fill a few plain fields — no
hand-written abfss:// URIs or JDBC connection strings — and run.
Installation
%pip install dash-ingest
Quick Start
import dashingest
dashingest.launch() # Opens interactive UI in your Databricks notebook
Or drive it directly from code:
from dashingest import ADLSSource, IngestTarget, run_ingestion
source = ADLSSource(storage_account="myacct", container="raw", path="sales/2024.csv")
target = IngestTarget(table="main.bronze.sales", write_mode="merge", merge_keys=["order_id"])
result = run_ingestion(source, target)
result.display()
Sources
| Kind | What you provide |
|---|---|
| Databricks Volume | catalog, schema, volume, path |
| ADLS Gen2 | storage account, container, path |
| Amazon S3 | bucket, path |
| DBFS | path |
| Database (JDBC) | engine (postgres/mysql/sqlserver/oracle/snowflake), host, database, table or query |
| REST API | URL, optional JSON path to the records |
File format (csv/json/parquet/excel/avro/orc/text) is inferred from the path's extension if not set explicitly — most ingestions need zero format options.
File format readers
Each format has its own options dataclass with real per-format defaults — not a generic options dict. Excel gets the most coverage, since vanilla Spark has no native Excel reader and a raw file path alone doesn't tell it which sheet to read, where the header starts, or whether the workbook is password-protected:
from dashingest import ExcelReaderOptions, VolumeSource
source = VolumeSource(
catalog="main", schema_name="bronze", volume="landing",
path="regional_sales.xlsx",
reader_options=ExcelReaderOptions(
sheet_name="Q1 Actuals",
header_row=2, # skips two title/banner rows above the header
workbook_password="secret", # optional
),
)
Set sheet_names=["Jan", "Feb", "Mar"] instead of sheet_name to read and
stack several same-shaped sheets into one DataFrame — the common "one tab
per month" spreadsheet layout.
CsvReaderOptions (delimiter, quote/escape chars, encoding, null markers,
date/timestamp formats, parse mode), JsonReaderOptions,
ParquetReaderOptions/OrcReaderOptions (schema merging), and
TextReaderOptions are also available — pass any of them via
reader_options= on a source.
Write modes
append · overwrite · merge (upsert into Delta by merge_keys, with
schema evolution where the runtime supports it).
Test Connection & Preview
Both the UI and the API let you check a source before committing to a full run — the same pattern ADF's linked-service "Test Connection" and dataset "Data preview" use:
from dashingest import test_connection, preview
test_connection(source).display() # reachability/credentials check, no data read
preview(source, limit=10) # pandas DataFrame of the first N rows
test_connection runs a lightweight check per source kind: SELECT 1 for
databases, an HTTP request for REST APIs, a filesystem existence check for
Volumes/ADLS/S3/DBFS (no dbutils needed — it uses Spark's Hadoop
filesystem API directly, so it works the same way across all of them).
Advanced database & REST options
DatabaseSource supports SSL, JDBC fetch size, parallel reads (split a
large table by partition_column across num_partitions), and a raw
connection_properties escape hatch:
from dashingest import DatabaseSource
source = DatabaseSource(
engine="postgresql", host="db.internal", database="analytics",
table="events", user="svc", password="...",
ssl=True, num_partitions=8, partition_column="id",
lower_bound=0, upper_bound=10_000_000,
)
RestApiSource supports auth (bearer / api_key / basic) and
pagination (page_param or cursor-based, up to max_pages):
from dashingest import RestApiSource
source = RestApiSource(
url="https://api.example.com/records",
auth_type="bearer", bearer_token="...",
pagination="cursor", cursor_json_path="meta.next_cursor", max_pages=50,
)
Part of Dashlibs
| Library | Purpose |
|---|---|
| dash-dq | Data Quality |
| dash-synthetic | Synthetic Data Generation |
| dash-ml | ML Lifecycle Management |
| dash-ingest | Data Ingestion |
| dash-gov | Data Governance |
| dash-ontology | Ontology & Lineage for AI |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dash_ingest-0.1.1.tar.gz.
File metadata
- Download URL: dash_ingest-0.1.1.tar.gz
- Upload date:
- Size: 73.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4b20e41c3e20d335714ef6aa860aa438e61b182ff59afd3b412357bfc070c64
|
|
| MD5 |
e52b53dfa77821247bca025108cd05c4
|
|
| BLAKE2b-256 |
35eec8fc916ccb18d76ec109acafb720383d7b38b9b2e59180262bc5389585a3
|
Provenance
The following attestation bundles were made for dash_ingest-0.1.1.tar.gz:
Publisher:
release.yml on dash-libs/dash-ingest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dash_ingest-0.1.1.tar.gz -
Subject digest:
f4b20e41c3e20d335714ef6aa860aa438e61b182ff59afd3b412357bfc070c64 - Sigstore transparency entry: 2046815326
- Sigstore integration time:
-
Permalink:
dash-libs/dash-ingest@53430048065520cfe5cc58fc1a94eb438e7b3978 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dash-libs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@53430048065520cfe5cc58fc1a94eb438e7b3978 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dash_ingest-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dash_ingest-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86fee88841f43acf835a2416b58f7acb3711e76665635e941fc9d24c456f5dce
|
|
| MD5 |
38ca71d59964bce55e1c4f83e91a948f
|
|
| BLAKE2b-256 |
fae3ace27e75d0830768fc5f65bbfe2bc6be1ffe8455dfb5f6ec7b7e38b31e95
|
Provenance
The following attestation bundles were made for dash_ingest-0.1.1-py3-none-any.whl:
Publisher:
release.yml on dash-libs/dash-ingest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dash_ingest-0.1.1-py3-none-any.whl -
Subject digest:
86fee88841f43acf835a2416b58f7acb3711e76665635e941fc9d24c456f5dce - Sigstore transparency entry: 2046815552
- Sigstore integration time:
-
Permalink:
dash-libs/dash-ingest@53430048065520cfe5cc58fc1a94eb438e7b3978 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/dash-libs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@53430048065520cfe5cc58fc1a94eb438e7b3978 -
Trigger Event:
workflow_dispatch
-
Statement type: