Generate optimal LLM context from pandas DataFrames within a token budget

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sserada

These details have not been verified by PyPI

Project description

dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

df.to_string() gives you millions of tokens
df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
Token budget management — output always fits within your specified token limit
Adaptive detail — higher budgets produce richer stats (percentiles, skewness, outlier rates)
Query hints — tell it what you're analyzing, and it prioritizes relevant columns
Correlation detection — find relationships between numeric columns
Outlier indicators — flag columns with potential outliers (IQR method)
Multiple formats — Markdown, plain text, or YAML output
Zero LLM dependency — pure data processing, works with any LLM provider
Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Correlation Detection

Find relationships between numeric columns:

ctx = to_context(df, token_budget=2000, include_correlations=True)
# Output includes: "sales ↔ quantity: r=+0.823 (strong positive)"

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import ColumnSummary, analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")
    if s.distribution_sketch:
        print(f"  histogram: [{s.distribution_sketch}]")
    if "outlier_rate" in s.stats:
        print(f"  outliers: {s.stats['outlier_rate'] * 100:.1f}%")

ColumnSummary fields: name, dtype, column_type, non_null_rate, unique_count, stats (dict), sample_values (list), distribution_sketch (str | None).

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function	Description
`to_context(df, ...)`	Generate LLM context string from a DataFrame
`analyze_columns(df)`	Get structured column analysis results
`count_tokens(text)`	Count tokens in text

Class	Description
`ContextConfig`	Configuration dataclass for `to_context()`
`ColumnSummary`	Structured result from column analysis
`BudgetPlan`	Token budget allocation plan

Examples

See the examples/ directory for runnable scripts:

with_claude.py — Analyze a DataFrame with Anthropic Claude
with_openai.py — Analyze a DataFrame with OpenAI GPT
compare_dataframes.py — Year-over-year comparison
budget_tuning.py — See how budget affects output (no API key needed)
mcp_server.py — Build an MCP tool that summarizes CSV files

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sserada

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Mar 16, 2026

0.1.1

Mar 15, 2026

0.1.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dfcontext-0.2.0.tar.gz (17.4 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dfcontext-0.2.0-py3-none-any.whl (26.5 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file dfcontext-0.2.0.tar.gz.

File metadata

Download URL: dfcontext-0.2.0.tar.gz
Upload date: Mar 16, 2026
Size: 17.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3d9f48ad8728bb7d68146cec62e49b6305c480fee7bae51184669969ac71e2d0`
MD5	`b5d095824a34c1d68deda1da14856b5e`
BLAKE2b-256	`331fb77535a1c3ae2dbeb307e01d7d33748493d7ddccda3382dde0b812e91926`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.2.0.tar.gz:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dfcontext-0.2.0.tar.gz
- Subject digest: 3d9f48ad8728bb7d68146cec62e49b6305c480fee7bae51184669969ac71e2d0
- Sigstore transparency entry: 1109429022
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: sserada/dfcontext@455567a4552639778ff8bad14eac56200108a52b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/sserada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@455567a4552639778ff8bad14eac56200108a52b
- Trigger Event: push

File details

Details for the file dfcontext-0.2.0-py3-none-any.whl.

File metadata

Download URL: dfcontext-0.2.0-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 26.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dfcontext-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38d9dc394ef099c22c1a364fbef51229ed1e0e97e11481aa491a8148af53c263`
MD5	`9e0bdd740fbd6fb38215695341416739`
BLAKE2b-256	`5cad45781d888cca3308c54744e029bab3f28d72f7f8c2ad1cf96d39cad48937`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dfcontext-0.2.0-py3-none-any.whl:

Publisher: ci.yml on sserada/dfcontext

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dfcontext-0.2.0-py3-none-any.whl
- Subject digest: 38d9dc394ef099c22c1a364fbef51229ed1e0e97e11481aa491a8148af53c263
- Sigstore transparency entry: 1109429024
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: sserada/dfcontext@455567a4552639778ff8bad14eac56200108a52b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/sserada
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@455567a4552639778ff8bad14eac56200108a52b
- Trigger Event: push

dfcontext 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

dfcontext

Why?

Install

Quick Start

Features

Advanced Usage

Query Hints

Output Formats

Configuration Object

Correlation Detection

Column Analysis

Token Counting

Use with Claude

API Reference

Examples

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance