Skip to main content

Tapes all your disparate data sources together into a single, queryable DuckDB catalog.

Project description

🦆 duck-tape

Tapes all your disparate data sources together into a single, queryable DuckDB catalog.

duck-tape is a powerful Python command-line utility that discovers data from a wide array of local files, remote URIs, and databases, and creates a unified, high-performance DuckDB database file from them. It acts as a "data catalog builder," making it easy to query all your data in one place using SQL.

It's designed to be the perfect companion for tools like the Hasura DuckDB Native Data Connector (NDC), providing a pre-built, queryable database file that's always ready for introspection.

What it Does

duck-tape uses a sophisticated, performance-first architecture. It automatically makes a smart choice for each data source based on a configurable "sweet spot" to balance performance, latency, and storage costs.

  1. Direct View (for Small or Very Large Files): For simple formats like CSV and JSON that are outside a configurable size range, it creates a direct VIEW.
    • Small files: Prioritizes low data latency and avoids unnecessary processing.
    • Very large files: Avoids the high storage cost of duplicating massive datasets.
  2. Convert-to-Parquet (for the "Sweet Spot" and Complex Files): For files that are in the ideal size range for performance gains, or for complex formats (Excel, PDF, etc.), it uses DuckDB or Pandas to parse them and transform them into the highly efficient Parquet format. It then creates a VIEW pointing to this intermediate Parquet file. This ensures maximum query performance where it matters most.

This hybrid approach ensures that all queries are fast and efficient while remaining flexible and cost-effective. The final .duckdb catalog file contains only VIEWs, keeping it incredibly small and portable.

Features

  • Advanced Configuration: Uses a layered configuration system (CLI > Environment Variables > YAML File > Defaults).
  • S3 Integration: Natively reads sources from and writes intermediate files to S3 buckets.
  • Smart Strategy Selection: Automatically chooses the best processing strategy (direct view vs. convert-to-parquet) based on file type and a configurable size range.
  • Smart Table Discovery:
    • For Excel: Intelligently finds and extracts multiple, distinct data tables from a single sheet, even if they are offset.
    • For XML: Traverses the document tree to find and extract nested arrays of data, handling complex, hierarchical files automatically.
  • Multi-Format Support: Ingests data from a huge variety of sources:
    • CSV, JSON, Parquet, Feather/Arrow
    • Excel (including all sheets)
    • PDF (Optional): Extracts tables if dependencies are installed.
    • HTML (scrapes tables from local files or live URLs)
    • Markdown (extracts tables)
    • XML, YAML (including multi-document files)
    • Avro, ORC
  • Direct Database Connection: Connects directly to SQLite databases and creates views for all tables.
  • Live Reloading (Watch Mode): An optional --watch mode automatically monitors local files and polls remote sources (HTTP/S, S3), updating the DuckDB catalog in real-time.

Setup & Installation

1. Python Libraries

You can install all necessary Python packages from the requirements.txt file:

pip install -r requirements.txt

2. Optional: Enabling PDF Support

PDF table extraction requires extra dependencies. If you need to process PDFs, follow these steps:

  • Install the Python library: Uncomment camelot-py[cv] in your requirements.txt file, or run:
    pip install "camelot-py[cv]"

  • Install System Dependencies: camelot requires Ghostscript.

    • On Debian/Ubuntu:
      sudo apt-get update && sudo apt-get install -y ghostscript python3-tk

    • On macOS (with Homebrew):
      brew install ghostscript

If you run duck-tape and these dependencies are not met, it will print a warning and safely skip any PDF files it finds.

Configuration

duck-tape uses a layered configuration system. Settings are loaded in the following order of priority:

  1. Command-Line Arguments (e.g., --config)
  2. Environment Variables (e.g., DUCKTAPE_OUTPUT_DB_FILE)
  3. YAML Configuration File (e.g., config.yaml)
  4. Script Defaults

Example config.yaml

Create a config.yaml file to define your settings.

# List of sources to process. Can be local globs, URLs, or DB connection strings.
sources:
- 'data/**/*'
- 's3://my-data-bucket/raw_files/sales.csv'
- 'https://www.w3schools.com/html/html\_tables.asp'

# Path for the final DuckDB catalog file.
output_db_file: 'catalog.db'

# Directory to store intermediate Parquet files. Can be a local path or an S3 URI.
intermediate_dir: 's3://my-data-bucket/ducktape_cache'

# Debounce delay in seconds for the local file watcher.
debounce_seconds: 2.0

# How often (in seconds) to poll remote sources (HTTP/S, S3) in watch mode.
# Set to 0 to disable remote polling.
polling_interval_seconds: 60

# "Sweet Spot" size range (in MB) for Parquet conversion.
convert_to_parquet_min_mb: 50
convert_to_parquet_max_mb: 1024

# S3-specific settings. Credentials should be set via environment variables.
s3:
region: 'us-east-1'
endpoint: null # Optional: for S3-compatible storage like MinIO
url_style: 'vhost' # or 'path'

# Settings for Smart Table Discovery in Excel files.
excel_table_discovery:
min_rows: 3
min_cols: 2

# Settings for Smart Table Discovery in XML files.
xml_table_discovery:
min_records: 3 # Min repeating elements to be considered a table
max_depth: 5 # How deep to search in the XML tree

Environment Variables

Environment Variable Overrides Setting Format
DUCKTAPE_SOURCES sources Comma-separated string
DUCKTAPE_OUTPUT_DB_FILE output_db_file String (file path)
DUCKTAPE_INTERMEDIATE_DIR intermediate_dir String (directory path or S3 URI)
DUCKTAPE_POLLING_INTERVAL_SECONDS polling_interval_seconds Integer
DUCKTAPE_S3_REGION s3.region String
AWS_ACCESS_KEY_ID S3 Authentication AWS Access Key
AWS_SECRET_ACCESS_KEY S3 Authentication AWS Secret Key
AWS_SESSION_TOKEN S3 Authentication (Optional) AWS Session Token

Usage

duck-tape is a command-line tool with two primary modes.

1. One-Time Build (Default)

This performs a full, clean build of the DuckDB catalog.

# Use the default config.yaml
python duck_tape.py --build

# Specify a different config file
python duck_tape.py --config prod_config.yaml --build

2. Watch Mode

For local development or in production, you can run the watcher. It will perform an initial build and then continue running, monitoring local source directories and polling remote sources, updating the DuckDB catalog in real-time.

python duck_tape.py --watch

# Use a different config file in watch mode
python duck_tape.py --config dev_config.yaml --watch

Limitations and Considerations

While duck-tape is powerful, it's important to understand its behavior in large-scale scenarios.

  • Large Number of Files: The initial build process scans all files to generate the catalog. If your sources contain tens of thousands of files, this initial scan can be time-consuming. The watcher, however, will only process changes incrementally after the initial build.
  • Very Large Individual Files: When a file is converted to Parquet, it must be fully read into memory by Pandas or DuckDB. For extremely large files (e.g., 50GB+), this can be memory-intensive. This is why the CONVERT_TO_PARQUET_MAX_MB setting is useful for avoiding this on massive files.
  • Remote Polling: The watcher polls remote sources on a timer. It is not event-driven, so changes will only be detected after the polling_interval_seconds has elapsed. It also cannot detect when a remote file is deleted; a manual --build is required to clean up views for deleted remote sources.
  • Complex Formats (PDF/HTML/XML): Table extraction from formats designed for presentation or complex hierarchies is not always perfect. duck-tape uses powerful libraries and smart heuristics to do its best, but results can vary depending on the structure of the source file. Always validate the resulting schema for these types of sources.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duck_tape_cat-1.0.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duck_tape_cat-1.0.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file duck_tape_cat-1.0.0.tar.gz.

File metadata

  • Download URL: duck_tape_cat-1.0.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for duck_tape_cat-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9490d1f502a0dc8d1f3b67fe401660bcf872cf42f36034863b58d5f938a24647
MD5 b467d65340d24cbae2a8a868d3f0efec
BLAKE2b-256 ea735ef9b3ed5b0e8d5746729d44b3c0023c0344788cadc5cf2a053cab3dac1a

See more details on using hashes here.

File details

Details for the file duck_tape_cat-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: duck_tape_cat-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.18

File hashes

Hashes for duck_tape_cat-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0eb1f1db6a9da81e3828a1b30ba685067798c6461092ea34e36514b7765d6d5c
MD5 426f916d32ba7ccd6d4683cfef131195
BLAKE2b-256 f4456292d0b0e720bde1b1f343d037fe1e8ad0d52045d26c4c2548a9e84ce356

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page