Tapes all your disparate data sources together into a single, queryable DuckDB catalog.
Project description
🦆 duck-tape
Tapes all your disparate data sources together into a single, queryable DuckDB catalog.
duck-tape is a powerful Python command-line utility that discovers data from a wide array of local files, remote URIs, and databases, and creates a unified, high-performance DuckDB database file from them. It acts as a "data catalog builder," making it easy to query all your data in one place using SQL.
It's designed to be the perfect companion for tools like the Hasura DuckDB Native Data Connector (NDC), providing a pre-built, queryable database file that's always ready for introspection.
What it Does
duck-tape uses a sophisticated, performance-first architecture. It automatically makes a smart choice for each data source based on a configurable "sweet spot" to balance performance, latency, and storage costs.
- Direct View (for Small or Very Large Files): For simple formats like CSV and JSON that are outside a configurable size range, it creates a direct VIEW.
- Small files: Prioritizes low data latency and avoids unnecessary processing.
- Very large files: Avoids the high storage cost of duplicating massive datasets.
- Convert-to-Parquet (for the "Sweet Spot" and Complex Files): For files that are in the ideal size range for performance gains, or for complex formats (Excel, PDF, etc.), it uses DuckDB or Pandas to parse them and transform them into the highly efficient Parquet format. It then creates a VIEW pointing to this intermediate Parquet file. This ensures maximum query performance where it matters most.
This hybrid approach ensures that all queries are fast and efficient while remaining flexible and cost-effective. The final .duckdb catalog file contains only VIEWs, keeping it incredibly small and portable.
Features
- Advanced Configuration: Uses a layered configuration system (CLI > Environment Variables > YAML File > Defaults).
- S3 Integration: Natively reads sources from and writes intermediate files to S3 buckets.
- Smart Strategy Selection: Automatically chooses the best processing strategy (direct view vs. convert-to-parquet) based on file type and a configurable size range.
- Smart Table Discovery:
- For Excel: Intelligently finds and extracts multiple, distinct data tables from a single sheet, even if they are offset.
- For XML: Traverses the document tree to find and extract nested arrays of data, handling complex, hierarchical files automatically.
- Multi-Format Support: Ingests data from a huge variety of sources:
- CSV, JSON, Parquet, Feather/Arrow
- Excel (including all sheets)
- PDF (Optional): Extracts tables if dependencies are installed.
- HTML (scrapes tables from local files or live URLs)
- Markdown (extracts tables)
- XML, YAML (including multi-document files)
- Avro, ORC
- Direct Database Connection: Connects directly to SQLite databases and creates views for all tables.
- Live Reloading (Watch Mode): An optional --watch mode automatically monitors local files and polls remote sources (HTTP/S, S3), updating the DuckDB catalog in real-time.
Setup & Installation
1. Python Libraries
You can install all necessary Python packages from the requirements.txt file:
pip install -r requirements.txt
2. Optional: Enabling PDF Support
PDF table extraction requires extra dependencies. If you need to process PDFs, follow these steps:
-
Install the Python library: Uncomment camelot-py[cv] in your requirements.txt file, or run:
pip install "camelot-py[cv]" -
Install System Dependencies: camelot requires Ghostscript.
-
On Debian/Ubuntu:
sudo apt-get update && sudo apt-get install -y ghostscript python3-tk -
On macOS (with Homebrew):
brew install ghostscript
-
If you run duck-tape and these dependencies are not met, it will print a warning and safely skip any PDF files it finds.
Configuration
duck-tape uses a layered configuration system. Settings are loaded in the following order of priority:
- Command-Line Arguments (e.g., --config)
- Environment Variables (e.g., DUCKTAPE_OUTPUT_DB_FILE)
- YAML Configuration File (e.g., config.yaml)
- Script Defaults
Example config.yaml
Create a config.yaml file to define your settings.
# List of sources to process. Can be local globs, URLs, or DB connection strings.
sources:
- 'data/**/*'
- 's3://my-data-bucket/raw_files/sales.csv'
- 'https://www.w3schools.com/html/html\_tables.asp'
# Path for the final DuckDB catalog file.
output_db_file: 'catalog.db'
# Directory to store intermediate Parquet files. Can be a local path or an S3 URI.
intermediate_dir: 's3://my-data-bucket/ducktape_cache'
# Debounce delay in seconds for the local file watcher.
debounce_seconds: 2.0
# How often (in seconds) to poll remote sources (HTTP/S, S3) in watch mode.
# Set to 0 to disable remote polling.
polling_interval_seconds: 60
# "Sweet Spot" size range (in MB) for Parquet conversion.
convert_to_parquet_min_mb: 50
convert_to_parquet_max_mb: 1024
# S3-specific settings. Credentials should be set via environment variables.
s3:
region: 'us-east-1'
endpoint: null # Optional: for S3-compatible storage like MinIO
url_style: 'vhost' # or 'path'
# Settings for Smart Table Discovery in Excel files.
excel_table_discovery:
min_rows: 3
min_cols: 2
# Settings for Smart Table Discovery in XML files.
xml_table_discovery:
min_records: 3 # Min repeating elements to be considered a table
max_depth: 5 # How deep to search in the XML tree
Environment Variables
| Environment Variable | Overrides Setting | Format |
|---|---|---|
| DUCKTAPE_SOURCES | sources | Comma-separated string |
| DUCKTAPE_OUTPUT_DB_FILE | output_db_file | String (file path) |
| DUCKTAPE_INTERMEDIATE_DIR | intermediate_dir | String (directory path or S3 URI) |
| DUCKTAPE_POLLING_INTERVAL_SECONDS | polling_interval_seconds | Integer |
| DUCKTAPE_S3_REGION | s3.region | String |
| AWS_ACCESS_KEY_ID | S3 Authentication | AWS Access Key |
| AWS_SECRET_ACCESS_KEY | S3 Authentication | AWS Secret Key |
| AWS_SESSION_TOKEN | S3 Authentication (Optional) | AWS Session Token |
Usage
duck-tape is a command-line tool with two primary modes.
1. One-Time Build (Default)
This performs a full, clean build of the DuckDB catalog.
# Use the default config.yaml
python duck_tape.py --build
# Specify a different config file
python duck_tape.py --config prod_config.yaml --build
2. Watch Mode
For local development or in production, you can run the watcher. It will perform an initial build and then continue running, monitoring local source directories and polling remote sources, updating the DuckDB catalog in real-time.
python duck_tape.py --watch
# Use a different config file in watch mode
python duck_tape.py --config dev_config.yaml --watch
Limitations and Considerations
While duck-tape is powerful, it's important to understand its behavior in large-scale scenarios.
- Large Number of Files: The initial build process scans all files to generate the catalog. If your sources contain tens of thousands of files, this initial scan can be time-consuming. The watcher, however, will only process changes incrementally after the initial build.
- Very Large Individual Files: When a file is converted to Parquet, it must be fully read into memory by Pandas or DuckDB. For extremely large files (e.g., 50GB+), this can be memory-intensive. This is why the CONVERT_TO_PARQUET_MAX_MB setting is useful for avoiding this on massive files.
- Remote Polling: The watcher polls remote sources on a timer. It is not event-driven, so changes will only be detected after the polling_interval_seconds has elapsed. It also cannot detect when a remote file is deleted; a manual --build is required to clean up views for deleted remote sources.
- Complex Formats (PDF/HTML/XML): Table extraction from formats designed for presentation or complex hierarchies is not always perfect. duck-tape uses powerful libraries and smart heuristics to do its best, but results can vary depending on the structure of the source file. Always validate the resulting schema for these types of sources.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duck_tape_cat-1.0.0.tar.gz.
File metadata
- Download URL: duck_tape_cat-1.0.0.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9490d1f502a0dc8d1f3b67fe401660bcf872cf42f36034863b58d5f938a24647
|
|
| MD5 |
b467d65340d24cbae2a8a868d3f0efec
|
|
| BLAKE2b-256 |
ea735ef9b3ed5b0e8d5746729d44b3c0023c0344788cadc5cf2a053cab3dac1a
|
File details
Details for the file duck_tape_cat-1.0.0-py3-none-any.whl.
File metadata
- Download URL: duck_tape_cat-1.0.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0eb1f1db6a9da81e3828a1b30ba685067798c6461092ea34e36514b7765d6d5c
|
|
| MD5 |
426f916d32ba7ccd6d4683cfef131195
|
|
| BLAKE2b-256 |
f4456292d0b0e720bde1b1f343d037fe1e8ad0d52045d26c4c2548a9e84ce356
|