Skip to main content

Generate optimal LLM context from pandas DataFrames within a token budget

Project description

dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

PyPI version Python License: MIT Tests

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

  • df.to_string() gives you millions of tokens
  • df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

  • Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
  • Token budget management — output always fits within your specified token limit
  • Adaptive detail — higher budgets produce richer stats (percentiles, skewness, outlier rates)
  • Query hints — tell it what you're analyzing, and it prioritizes relevant columns
  • Correlation detection — find relationships between numeric columns
  • Outlier indicators — flag columns with potential outliers (IQR method)
  • Multiple formats — Markdown, plain text, or YAML output
  • Zero LLM dependency — pure data processing, works with any LLM provider
  • Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Correlation Detection

Find relationships between numeric columns:

ctx = to_context(df, token_budget=2000, include_correlations=True)
# Output includes: "sales ↔ quantity: r=+0.823 (strong positive)"

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import ColumnSummary, analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")
    if s.distribution_sketch:
        print(f"  histogram: [{s.distribution_sketch}]")
    if "outlier_rate" in s.stats:
        print(f"  outliers: {s.stats['outlier_rate'] * 100:.1f}%")

ColumnSummary fields: name, dtype, column_type, non_null_rate, unique_count, stats (dict), sample_values (list), distribution_sketch (str | None).

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function Description
to_context(df, ...) Generate LLM context string from a DataFrame
analyze_columns(df) Get structured column analysis results
count_tokens(text) Count tokens in text
Class Description
ContextConfig Configuration dataclass for to_context()
ColumnSummary Structured result from column analysis
BudgetPlan Token budget allocation plan

Examples

See the examples/ directory for runnable scripts:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfcontext-0.2.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dfcontext-0.2.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file dfcontext-0.2.0.tar.gz.

File metadata

  • Download URL: dfcontext-0.2.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3d9f48ad8728bb7d68146cec62e49b6305c480fee7bae51184669969ac71e2d0
MD5 b5d095824a34c1d68deda1da14856b5e
BLAKE2b-256 331fb77535a1c3ae2dbeb307e01d7d33748493d7ddccda3382dde0b812e91926

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.2.0.tar.gz:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dfcontext-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dfcontext-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 38d9dc394ef099c22c1a364fbef51229ed1e0e97e11481aa491a8148af53c263
MD5 9e0bdd740fbd6fb38215695341416739
BLAKE2b-256 5cad45781d888cca3308c54744e029bab3f28d72f7f8c2ad1cf96d39cad48937

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.2.0-py3-none-any.whl:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page