Skip to main content

A library for packaging together data + documentation into an agent-friendly duckdb artifact.

Project description

DataGent

🎩
🧐 🦆

A small library for creating efficient file-specific agents / RAG systems with duckdb.

Overview

Data Gent packages together:

  • Loading data files and data documentation into a duckdb database with pre-built vector and full-text indices on the data dictionary's contents.
  • Simple interfaces for chunking + embedding documents and loading data, allowing the user to customize how the duckdb artifact is created.
    • Out-of-the-box chunking: Semchunk
    • Out-of-the-box embedding: Bedrock
  • Hybrid BM-25 / HNSW retrieval on the generated database.

The project is currently in a "proof-of-concept/playing around" phase, but in my mind could help to solve the problem that existing semantic layers are often tightly-coupled to vendors like Databricks or Snowflake, increasing vendor lock-in and coupling to spark workloads that are often overkill for the size of the data in question.

TODO:

  • Support vector similarity metrics other than cosine similarity
  • Implement a set of tools to allow an agent to work with the artifact
  • Convenience functionality to auto-load from fs (/httpfs)?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_gentry-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file data_gentry-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: data_gentry-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.16 {"installer":{"name":"uv","version":"0.9.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Linux Mint","version":"21.1","id":"vera","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for data_gentry-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f97c59922825f60d204cd682ac05370f2b362dc04f500d2f853aef0219f7afba
MD5 9fc8a2a9aa335dd726b38218f07abb98
BLAKE2b-256 0aeb80a3361fe4c30d3ccfb917d1b8346d8cc93f3ddeaaf5c51049d73e7f2851

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page