Skip to main content

A high-performance CLI tool to convert local data science workspaces into LLM-ready context.

Project description

Data2Prompt Banner

License Python 3.10+ Status

High-performance codebase-to-prompt orchestration for Data Science workflows and data-heavy projects.

data2prompt is a CLI tool designed to bridge the gap between local data-heavy projects and Large Language Model (LLM) context windows. Unlike generic code-packagers, it provides an intelligent,optimized output for LLM attention mechanism, token-aware representation of a project's structure and content.

📝 Important Note

Data2prompt is purpose-built for data-heavy projects (.csv, .sql, .xlsx, .ipynb), not large pure-code repositories. It intelligently samples and truncates data files to prevent context window explosion while preserving semantic structure.

🎯 Why Data2Prompt?

Generic code-to-prompt tools choke on data files—they either skip them entirely or dump raw CSVs that waste 90% of your context window. Data2Prompt solves this with intelligent sampling, schema extraction, and LLM-optimized formatting specifically designed for data science workflows.

✨ Core Features

  • Smart Jupyter Parsing: Intelligently extracts code, markdown, and text outputs from .ipynb files while stripping heavy Base64 images and raw HTML to preserve context.
  • Multi-Format Sampling: Advanced sampling strategies for CSV, SQL, and Excel files to preserve schema and data context which reduces the data size significantly while extracting the needed context for llm.
  • Aggressive truncations: To preserve context, long lines are truncated to neutralize line injections and avoid exploding the context windows, if a tabular data was still to large after sampling it will get truncated to a certain amount, also if a raw text file of unhandled type was too large it will get truncated to a certain amount.
  • Defensive Processing: Automatic binary detection (Null-byte checks), Checks if a file is binary by looking for a Null byte in the first 1024 bytes.
  • Optimized LLM attention: The default output format is markdown with well structured schema and another option is xml output with xml style tags to enhance LLM anchoring for complex analysis and large context windows
  • Token-Aware Output: Real-time token estimation using tiktoken (o200k_base) to ensure prompts fit target LLMs (Claude 3.5, GPT-4o, Gemini 1.5) and advanced offline token counting via regex.
  • Professional TUI: A high-fidelity terminal interface built with Rich, featuring a Matrix-style startup animation and interactive, scrollable reports on Windows.
  • Dynamic Markdown Wrapping: Uses intelligent backtick depth to ensure robust nesting of code blocks in the final output.
  • Gitignore aware: Respects the .gitignore rules by default and you can turn this feature off with cli argument(--no-gitignore) if needed.

🏗️ Architecture & Engineering Standards

This project is a portfolio-grade implementation of the Modular Functional Orchestration (MFO) pattern, reflecting senior-level engineering maturity:

  • Registry & Strategy Patterns: Uses a ParserRegistry for extensible file handling and an OutputGenerator strategy for multiple formats (Markdown, XML).
  • Centralized Configuration: All core logic, magic numbers, and default ignore lists reside in src/data2prompt/constants.py.
  • Strict Type Hinting: Fully typed function signatures (PEP 484) across all modules.
  • UI Encapsulation: All terminal feedback is handled by a dedicated UIHandler, ensuring a clean separation between logic and presentation.

For a deep dive into the system design, see the Architecture Documentation.

🚀 Quick Start

Installation

Ensure you have Python 3.10+ installed.

# Clone the repository
git clone https://github.com/arianmokhtariha/data2prompt.git
cd data2prompt

# Install normally
pip install .

# Install in editable mode
pip install -e .

# Its Recommended to use pipx instead of pip for easier venv handling

Usage

Run data2prompt in your project root to generate a structured prompt:

# Basic usage (defaults to markdown output)
data2prompt

# Custom output with xml format and specific sampling
data2prompt --output my_analysis --format xml --csv-sample-size 50 --ignore-folders venv .pytest_cache

CLI Arguments

Argument Description Default
-o, --output Base name of the generated file PROMPT
-f, --format Output format (xml or markdown) markdown
-s, --csv-sample-size Number of random rows to sample from CSVs 15
--max-lines Max lines of text output per notebook cell 40
--max-file-size Max file size in KB to read entirely 70

See the CLI Reference for a full list of arguments.

📚 Documentation

Explore the detailed documentation for more information:

🛠️ Developer Setup

To contribute or run tests:

pip install -e .[dev]
pytest

🌟 Show Your Support

If Data2Prompt saves you token costs or speeds up your workflow, consider:

  • ⭐ Starring the repo
  • 🐛 Reporting issues or suggesting features
  • 🔀 Contributing parsers for new file types

Built with precision for the modern AI-assisted development workflow.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data2prompt-0.1.0.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data2prompt-0.1.0-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file data2prompt-0.1.0.tar.gz.

File metadata

  • Download URL: data2prompt-0.1.0.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for data2prompt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7420950ec42e413e9475d1c36677d0848ebd61615b634442b6919155222fa138
MD5 cbcb7e88e3b1f87f7ae566e2abf9d915
BLAKE2b-256 37507e6025d919abe52de14cc61f235f82ff1a98680f4209bbcdcbc1e16e40f7

See more details on using hashes here.

File details

Details for the file data2prompt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: data2prompt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for data2prompt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 03920c6ece6d0464fee2f83ff0d422d778c86bfd2fece88289fc79c43c24fe69
MD5 7537787b8a30fff732f6582f84f49672
BLAKE2b-256 4a80f9f4a230fffde152534f7601df11cef99e3978fcf946fb62a84c7f13f66e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page