Tapes all your disparate data sources together into a single, queryable DuckDB catalog.

These details have not been verified by PyPI

Project description

🦆 duck-tape

Tapes all your disparate data sources together into a single, queryable DuckDB catalog.

duck-tape is a powerful Python command-line utility that discovers data from a wide array of local files, remote URIs, and databases, and creates a unified, high-performance DuckDB database file from them. It acts as a "data catalog builder," making it easy to query all your data in one place using SQL.

It's designed to be the perfect companion for tools like the Hasura DuckDB Native Data Connector (NDC), providing a pre-built, queryable database file that's always ready for introspection.

What it Does

duck-tape uses a sophisticated, performance-first architecture. It automatically makes a smart choice for each data source based on a configurable "sweet spot" to balance performance, latency, and storage costs.

Direct View (for Small or Very Large Files): For simple formats like CSV and JSON that are outside a configurable size range, it creates a direct VIEW.
- Small files: Prioritizes low data latency and avoids unnecessary processing.
- Very large files: Avoids the high storage cost of duplicating massive datasets.
Convert-to-Parquet (for the "Sweet Spot" and Complex Files): For files that are in the ideal size range for performance gains, or for complex formats (Excel, PDF, etc.), it uses DuckDB or Pandas to parse them and transform them into the highly efficient Parquet format. It then creates a VIEW pointing to this intermediate Parquet file. This ensures maximum query performance where it matters most.

This hybrid approach ensures that all queries are fast and efficient while remaining flexible and cost-effective. The final .duckdb catalog file contains only VIEWs, keeping it incredibly small and portable.

Features

Advanced Configuration: Uses a layered configuration system (CLI > Environment Variables > YAML File > Defaults).
S3 Integration: Natively reads sources from and writes intermediate files to S3 buckets.
Smart Strategy Selection: Automatically chooses the best processing strategy (direct view vs. convert-to-parquet) based on file type and a configurable size range.
Smart Table Discovery:
- For Excel: Intelligently finds and extracts multiple, distinct data tables from a single sheet, even if they are offset.
- For XML: Traverses the document tree to find and extract nested arrays of data, handling complex, hierarchical files automatically.
Multi-Format Support: Ingests data from a huge variety of sources:
- CSV, JSON, Parquet, Feather/Arrow
- Excel (including all sheets)
- PDF (Optional): Extracts tables if dependencies are installed.
- HTML (scrapes tables from local files or live URLs)
- Markdown (extracts tables)
- XML, YAML (including multi-document files)
- Avro, ORC
Direct Database Connection: Connects directly to SQLite databases and creates views for all tables.
Live Reloading (Watch Mode): An optional --watch mode automatically monitors local files and polls remote sources (HTTP/S, S3), updating the DuckDB catalog in real-time.

Setup & Installation

1. Python Libraries

You can install all necessary Python packages from the requirements.txt file:

pip install -r requirements.txt

2. Optional: Enabling PDF Support

PDF table extraction requires extra dependencies. If you need to process PDFs, follow these steps:

Install the Python library: Uncomment camelot-py[cv] in your requirements.txt file, or run:
pip install "camelot-py[cv]"
Install System Dependencies: camelot requires Ghostscript.
- On Debian/Ubuntu:
  sudo apt-get update && sudo apt-get install -y ghostscript python3-tk
- On macOS (with Homebrew):
  brew install ghostscript

If you run duck-tape and these dependencies are not met, it will print a warning and safely skip any PDF files it finds.

Configuration

duck-tape uses a layered configuration system. Settings are loaded in the following order of priority:

Command-Line Arguments (e.g., --config)
Environment Variables (e.g., DUCKTAPE_OUTPUT_DB_FILE)
YAML Configuration File (e.g., config.yaml)
Script Defaults

Example config.yaml

Create a config.yaml file to define your settings.

# List of sources to process. Can be local globs, URLs, or DB connection strings.
sources:
- 'data/**/*'
- 's3://my-data-bucket/raw_files/sales.csv'
- 'https://www.w3schools.com/html/html\_tables.asp'

# Path for the final DuckDB catalog file.
output_db_file: 'catalog.db'

# Directory to store intermediate Parquet files. Can be a local path or an S3 URI.
intermediate_dir: 's3://my-data-bucket/ducktape_cache'

# Debounce delay in seconds for the local file watcher.
debounce_seconds: 2.0

# How often (in seconds) to poll remote sources (HTTP/S, S3) in watch mode.
# Set to 0 to disable remote polling.
polling_interval_seconds: 60

# "Sweet Spot" size range (in MB) for Parquet conversion.
convert_to_parquet_min_mb: 50
convert_to_parquet_max_mb: 1024

# S3-specific settings. Credentials should be set via environment variables.
s3:
region: 'us-east-1'
endpoint: null # Optional: for S3-compatible storage like MinIO
url_style: 'vhost' # or 'path'

# Settings for Smart Table Discovery in Excel files.
excel_table_discovery:
min_rows: 3
min_cols: 2

# Settings for Smart Table Discovery in XML files.
xml_table_discovery:
min_records: 3 # Min repeating elements to be considered a table
max_depth: 5 # How deep to search in the XML tree

Environment Variables

Environment Variable	Overrides Setting	Format
DUCKTAPE_SOURCES	sources	Comma-separated string
DUCKTAPE_OUTPUT_DB_FILE	output_db_file	String (file path)
DUCKTAPE_INTERMEDIATE_DIR	intermediate_dir	String (directory path or S3 URI)
DUCKTAPE_POLLING_INTERVAL_SECONDS	polling_interval_seconds	Integer
DUCKTAPE_S3_REGION	s3.region	String
AWS_ACCESS_KEY_ID	S3 Authentication	AWS Access Key
AWS_SECRET_ACCESS_KEY	S3 Authentication	AWS Secret Key
AWS_SESSION_TOKEN	S3 Authentication (Optional)	AWS Session Token

Usage

duck-tape is a command-line tool with two primary modes.

1. One-Time Build (Default)

This performs a full, clean build of the DuckDB catalog.

# Use the default config.yaml
python duck_tape.py --build

# Specify a different config file
python duck_tape.py --config prod_config.yaml --build

2. Watch Mode

For local development or in production, you can run the watcher. It will perform an initial build and then continue running, monitoring local source directories and polling remote sources, updating the DuckDB catalog in real-time.

python duck_tape.py --watch

# Use a different config file in watch mode
python duck_tape.py --config dev_config.yaml --watch

Limitations and Considerations

While duck-tape is powerful, it's important to understand its behavior in large-scale scenarios.

Large Number of Files: The initial build process scans all files to generate the catalog. If your sources contain tens of thousands of files, this initial scan can be time-consuming. The watcher, however, will only process changes incrementally after the initial build.
Very Large Individual Files: When a file is converted to Parquet, it must be fully read into memory by Pandas or DuckDB. For extremely large files (e.g., 50GB+), this can be memory-intensive. This is why the CONVERT_TO_PARQUET_MAX_MB setting is useful for avoiding this on massive files.
Remote Polling: The watcher polls remote sources on a timer. It is not event-driven, so changes will only be detected after the polling_interval_seconds has elapsed. It also cannot detect when a remote file is deleted; a manual --build is required to clean up views for deleted remote sources.
Complex Formats (PDF/HTML/XML): Table extraction from formats designed for presentation or complex hierarchies is not always perfect. duck-tape uses powerful libraries and smart heuristics to do its best, but results can vary depending on the structure of the source file. Always validate the resulting schema for these types of sources.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duck_tape_cat-1.0.0.tar.gz (17.2 kB view details)

Uploaded Jul 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

duck_tape_cat-1.0.0-py3-none-any.whl (17.0 kB view details)

Uploaded Jul 23, 2025 Python 3

File details

Details for the file duck_tape_cat-1.0.0.tar.gz.

File metadata

Download URL: duck_tape_cat-1.0.0.tar.gz
Upload date: Jul 23, 2025
Size: 17.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for duck_tape_cat-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9490d1f502a0dc8d1f3b67fe401660bcf872cf42f36034863b58d5f938a24647`
MD5	`b467d65340d24cbae2a8a868d3f0efec`
BLAKE2b-256	`ea735ef9b3ed5b0e8d5746729d44b3c0023c0344788cadc5cf2a053cab3dac1a`

See more details on using hashes here.

File details

Details for the file duck_tape_cat-1.0.0-py3-none-any.whl.

File metadata

Download URL: duck_tape_cat-1.0.0-py3-none-any.whl
Upload date: Jul 23, 2025
Size: 17.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for duck_tape_cat-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0eb1f1db6a9da81e3828a1b30ba685067798c6461092ea34e36514b7765d6d5c`
MD5	`426f916d32ba7ccd6d4683cfef131195`
BLAKE2b-256	`f4456292d0b0e720bde1b1f343d037fe1e8ad0d52045d26c4c2548a9e84ce356`

See more details on using hashes here.

duck-tape-cat 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🦆 duck-tape

What it Does

Features

Setup & Installation

Configuration

Example config.yaml

Environment Variables

Usage

Limitations and Considerations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes