Skip to main content

Generate optimal LLM context from pandas DataFrames within a token budget

Project description

dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

PyPI version Python License: MIT Tests

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

  • df.to_string() gives you millions of tokens
  • df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

  • Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
  • Token budget management — output always fits within your specified token limit
  • Query hints — tell it what you're analyzing, and it prioritizes relevant columns
  • Multiple formats — Markdown, plain text, or YAML output
  • Zero LLM dependency — pure data processing, works with any LLM provider
  • Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function Description
to_context(df, ...) Generate LLM context string from a DataFrame
analyze_columns(df) Get structured column analysis results
count_tokens(text) Count tokens in text
Class Description
ContextConfig Configuration dataclass for to_context()
ColumnSummary Structured result from column analysis
BudgetPlan Token budget allocation plan

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfcontext-0.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfcontext-0.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file dfcontext-0.1.0.tar.gz.

File metadata

  • Download URL: dfcontext-0.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab78474159b409467bf5a8e975d0633fc3974210c54e42014fc7d666cbca3b45
MD5 cdb6aea8a04b8009d49d14f68ef1fc76
BLAKE2b-256 61430f406e938abc5b4268b5d94f140eb3536c0f9c72b437abfc5b4edbc65081

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.1.0.tar.gz:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dfcontext-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dfcontext-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e17189db3d8da7a8395b85ce587d2c2758e0d7d6988c5676f6255227afc00675
MD5 f0b43db990ecb1996415c5c360ccce93
BLAKE2b-256 f6afd10e1ab50aefc54755d87088d45d4bf21ac5e5864c3a2c25373bb3dbd32b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.1.0-py3-none-any.whl:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page