Skip to main content

Generate optimal LLM context from pandas DataFrames within a token budget

Project description

dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

PyPI version Python License: MIT Tests

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

  • df.to_string() gives you millions of tokens
  • df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

  • Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
  • Token budget management — output always fits within your specified token limit
  • Query hints — tell it what you're analyzing, and it prioritizes relevant columns
  • Multiple formats — Markdown, plain text, or YAML output
  • Zero LLM dependency — pure data processing, works with any LLM provider
  • Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function Description
to_context(df, ...) Generate LLM context string from a DataFrame
analyze_columns(df) Get structured column analysis results
count_tokens(text) Count tokens in text
Class Description
ContextConfig Configuration dataclass for to_context()
ColumnSummary Structured result from column analysis
BudgetPlan Token budget allocation plan

Examples

See the examples/ directory for runnable scripts:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfcontext-0.1.1.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfcontext-0.1.1-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file dfcontext-0.1.1.tar.gz.

File metadata

  • Download URL: dfcontext-0.1.1.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9a2a8133277d08c9859ec7d77a5868542b56d3ab49e4addf097799da89076f2b
MD5 d74b2242b0bec4c00422a678d75147f2
BLAKE2b-256 9cb168a9446a54c424d145937f0b36649bc4fd33f59732cd4c88a29e7e7efa28

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.1.1.tar.gz:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dfcontext-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dfcontext-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd32cc1f757f0fec0cec71ac425cd5b72ed85c91f16272567de588355f4a5fea
MD5 3ea1a89ffcb4c232ede674537b5980ee
BLAKE2b-256 470f274d07ff13b3c3dc2c4520ce8e8645a7654afc37e71107296ff36c732aaf

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.1.1-py3-none-any.whl:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page