Generate optimal LLM context from pandas DataFrames within a token budget
Project description
dfcontext
Generate optimal LLM context from pandas DataFrames within a token budget.
Why?
You have a 100K-row DataFrame. Your LLM has a context window.
df.to_string()gives you millions of tokensdf.head()gives you 5 rows with no statistical context
dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.
Install
pip install dfcontext
Optional dependencies for accurate token counting and YAML output:
pip install dfcontext[all] # tiktoken + pyyaml
pip install dfcontext[tiktoken] # accurate token counting only
pip install dfcontext[yaml] # YAML format output only
Quick Start
import pandas as pd
from dfcontext import to_context
df = pd.read_csv("sales.csv") # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)
Output:
## Dataset overview
- 100,000 rows × 5 columns
## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |
## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)
### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]
### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly
### is_return (boolean)
True: 6.0% | False: 94.0%
## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |
Features
- Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
- Token budget management — output always fits within your specified token limit
- Query hints — tell it what you're analyzing, and it prioritizes relevant columns
- Multiple formats — Markdown, plain text, or YAML output
- Zero LLM dependency — pure data processing, works with any LLM provider
- Fast — handles 100K rows in under a second
Advanced Usage
Query Hints
Provide a hint to allocate more token budget to relevant columns:
ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis
Output Formats
ctx_md = to_context(df, format="markdown") # default
ctx_plain = to_context(df, format="plain") # no markdown syntax
ctx_yaml = to_context(df, format="yaml") # requires pyyaml
Configuration Object
For full control, use ContextConfig:
from dfcontext import ContextConfig, to_context
config = ContextConfig(
token_budget=3000,
format="markdown",
hint="churn analysis",
include_schema=True,
include_stats=True,
include_samples=True,
max_sample_rows=5,
)
ctx = to_context(df, config=config)
Column Analysis
Get structured analysis results as Python objects:
from dfcontext import analyze_columns
summaries = analyze_columns(df)
for name, s in summaries.items():
print(f"{name}: {s.column_type}, {s.unique_count} unique")
Token Counting
from dfcontext import count_tokens
tokens = count_tokens("some text")
Use with Claude
import anthropic
from dfcontext import to_context
df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"{ctx}\n\nWhat are the key sales trends?",
}],
)
API Reference
| Function | Description |
|---|---|
to_context(df, ...) |
Generate LLM context string from a DataFrame |
analyze_columns(df) |
Get structured column analysis results |
count_tokens(text) |
Count tokens in text |
| Class | Description |
|---|---|
ContextConfig |
Configuration dataclass for to_context() |
ColumnSummary |
Structured result from column analysis |
BudgetPlan |
Token budget allocation plan |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dfcontext-0.1.0.tar.gz.
File metadata
- Download URL: dfcontext-0.1.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab78474159b409467bf5a8e975d0633fc3974210c54e42014fc7d666cbca3b45
|
|
| MD5 |
cdb6aea8a04b8009d49d14f68ef1fc76
|
|
| BLAKE2b-256 |
61430f406e938abc5b4268b5d94f140eb3536c0f9c72b437abfc5b4edbc65081
|
Provenance
The following attestation bundles were made for dfcontext-0.1.0.tar.gz:
Publisher:
ci.yml on sserada/dfcontext
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dfcontext-0.1.0.tar.gz -
Subject digest:
ab78474159b409467bf5a8e975d0633fc3974210c54e42014fc7d666cbca3b45 - Sigstore transparency entry: 1108188722
- Sigstore integration time:
-
Permalink:
sserada/dfcontext@000582025cb6e841e1feb68765a3d3f2cff099b1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sserada
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@000582025cb6e841e1feb68765a3d3f2cff099b1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dfcontext-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dfcontext-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e17189db3d8da7a8395b85ce587d2c2758e0d7d6988c5676f6255227afc00675
|
|
| MD5 |
f0b43db990ecb1996415c5c360ccce93
|
|
| BLAKE2b-256 |
f6afd10e1ab50aefc54755d87088d45d4bf21ac5e5864c3a2c25373bb3dbd32b
|
Provenance
The following attestation bundles were made for dfcontext-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on sserada/dfcontext
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dfcontext-0.1.0-py3-none-any.whl -
Subject digest:
e17189db3d8da7a8395b85ce587d2c2758e0d7d6988c5676f6255227afc00675 - Sigstore transparency entry: 1108188732
- Sigstore integration time:
-
Permalink:
sserada/dfcontext@000582025cb6e841e1feb68765a3d3f2cff099b1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sserada
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@000582025cb6e841e1feb68765a3d3f2cff099b1 -
Trigger Event:
push
-
Statement type: