Query-aware semantic chunk selection under LLM context-window budgets.

These details have not been verified by PyPI

Project links

Project description

TokenPack-RAG

TokenPack-RAG packs the most useful evidence chunks into a smaller LLM-ready context file.

It turns long-context selection into a budgeted context-packing problem: chunks are items, token counts are weights, and query-conditioned evidence scores are values. The default pipeline is the current strongest setting from the paper:

structure-aware semantic chunks + evidence-hybrid scoring + hybrid-greedy budget fill

The practical goal is simple: give your LLM less context while keeping the evidence that matters.

What You Get

A one-command CLI: tokenpack-rag pack SOURCE --query "..."
Automatic token-budget estimation when you do not know what budget to choose.
Automatic Markdown output next to your source file, such as paper-tp.md.
Budget-valid context selection for long documents, code, PDFs, or mixed folders.
Advanced ingest, select, export-context, answer, and benchmark commands for experiments.
Optional local MCP server for agent tools such as Claude Desktop, Cursor, or Codex.
Optional second-stage prompt compression with LLMLingua / LongLLMLingua.
Reproducible paper artifacts under submission/.

Install

From PyPI, once published:

pip install tokenpack-rag

From GitHub today:

pip install "tokenpack-rag @ git+https://github.com/mo-tunn/TokenPack.git"

For PDF parsing, Office files, token counting, compression, and development tools:

pip install "tokenpack-rag[pdf,office,tokens,compression,dev] @ git+https://github.com/mo-tunn/TokenPack.git"

For local agent/MCP usage:

pip install "tokenpack-rag[mcp,pdf,office,tokens] @ git+https://github.com/mo-tunn/TokenPack.git"

For local editable development:

git clone https://github.com/mo-tunn/TokenPack.git
cd TokenPack
pip install -e ".[pdf,office,tokens,compression,dev]"

TokenPack-RAG uses sentence-transformers/all-MiniLM-L6-v2 as the default embedding model. Use --offline-models only when the model is already cached locally.

Quick Start

Pack one document into an LLM-ready Markdown context:

tokenpack-rag pack README.md --query "How does TokenPack reduce LLM context cost?"

This writes:

README-tp.md

For a PDF:

tokenpack-rag pack paper.pdf --query "What are the main contributions?"

This writes:

paper-tp.md

For a folder:

tokenpack-rag pack docs/ --query "Summarize the design decisions in this project."

This writes:

docs-tp.md

The output is not a modified PDF. It is a packed Markdown context file that you can paste or upload into your own LLM.

Supported Inputs

TokenPack-RAG accepts a single file or a folder. Folder inputs are scanned recursively and unsupported binary/media files are skipped.

Category	Extensions
Text and docs	`.txt`, `.text`, `.md`, `.markdown`, `.rst`, `.adoc`, `.tex`, `.log`
PDF	`.pdf` with the `pdf` extra
Web	`.html`, `.htm`
Data/config	`.json`, `.jsonl`, `.csv`, `.tsv`, `.yaml`, `.yml`, `.toml`
Office	`.docx`, `.pptx`, `.xlsx` with the `office` extra
Code	`.py`, `.js`, `.jsx`, `.ts`, `.tsx`, `.java`, `.go`, `.rs`, `.c`, `.cpp`, `.cs`, `.php`, `.rb`, `.swift`, `.kt`, `.scala`, `.sh`, `.ps1`, `.sql`, `.css`, `.xml`, and related variants

Office support is optional so the base install stays lighter:

pip install "tokenpack-rag[office]"

Auto Budget

--budget is optional. When you omit it, TokenPack-RAG estimates a budget from the source size:

source_tokens = sum(chunk.token_count for chunk in index.chunks)
raw_budget = ceil(source_tokens * 0.50)
budget = clamp(raw_budget, min_budget=1200, max_budget=64000)
reserve_output = min(4000, max(512, int(budget * 0.10)))
selection_budget = budget - reserve_output

Example terminal summary:

Source: paper.pdf
Output: paper-tp.md
Source tokens: 142,000
Auto budget: 64,000 tokens (ratio=50%, capped by max-budget)
Reserved for answer: 4,000
Selection budget: 60,000
Selected: 188 chunks / 59,240 tokens

You can still take control when you want a smaller or larger packed context:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --budget 32000 \
  --overwrite

Other budget controls:

tokenpack-rag pack paper.pdf --query "..." --budget-ratio 0.35
tokenpack-rag pack paper.pdf --query "..." --max-budget 128000
tokenpack-rag pack paper.pdf --query "..." --reserve-output 2000

The default 64k cap is intentional: TokenPack-RAG does local embedding and selection, so the packing step itself does not spend LLM API tokens. The cap is aimed at modern long-context models while still preventing unexpectedly huge output files.

Output Files

By default, TokenPack-RAG writes the packed context next to the source:

Source	Output
`paper.pdf`	`paper-tp.md`
`notes.txt`	`notes-tp.md`
`docs/`	`docs-tp.md`

Existing output files are protected by default:

tokenpack-rag pack paper.pdf --query "..."

If paper-tp.md already exists, the command stops. Use --overwrite or choose an explicit path:

tokenpack-rag pack paper.pdf --query "..." --overwrite
tokenpack-rag pack paper.pdf --query "..." --out packed-context.md

Internal artifacts go under .tokenpack/runs/<timestamp>/ unless you choose paths:

tokenpack-rag pack paper.pdf \
  --query "..." \
  --index-out .tokenpack/paper.index.json \
  --selection-out paper-tp.selection.json

Optional Compression

TokenPack-RAG is selection-first by default. You can optionally compress the selected evidence with LLMLingua:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --compression-rate 0.85

For LongLLMLingua-style query-conditioned compression:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --longllmlingua \
  --compression-rate 0.85

By default, compression models are expected to be cached locally. Add --allow-download when you intentionally want Hugging Face downloads during compression.

Use With Agents / MCP

TokenPack-RAG can also run as a local stdio MCP server. This lets an agent call TokenPack directly as a tool, produce a packed Markdown context, and then reason over that selected context.

Install with MCP support:

pipx install "tokenpack-rag[mcp,pdf,office,tokens]"

Add a local MCP server to your agent config:

{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "tokenpack-rag-mcp",
      "args": ["--workspace", "/path/to/project"]
    }
  }
}

Or run it through uvx without a permanent install:

{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "uvx",
      "args": [
        "--from",
        "tokenpack-rag[mcp,pdf,office,tokens]",
        "tokenpack-rag-mcp",
        "--workspace",
        "/path/to/project"
      ]
    }
  }
}

The MCP server exposes:

Tool	Purpose
`pack_context`	Packs a file or folder into selected Markdown context and writes the `-tp.md` artifact.
`read_packed_context`	Reads a packed context file, optionally in slices for large contexts.

By default the MCP server can only read and write inside --workspace. Use --allow-any-path only for trusted local setups.

Advanced CLI

The one-command pack workflow is the main user-facing interface. The lower-level commands remain available for experiments and reproducible paper runs.

Build an index:

tokenpack-rag ingest README.md --index .tokenpack/readme-index.json

Select evidence under a manual budget:

tokenpack-rag select \
  --index .tokenpack/readme-index.json \
  --query "How does TokenPack reduce LLM context cost?" \
  --budget 3000 \
  --reserve-output 500 \
  --output .tokenpack/selection.json

Export the selected context:

tokenpack-rag export-context \
  --selection .tokenpack/selection.json \
  --output .tokenpack/context.txt

By default, these commands use:

chunker: structure-aware semantic boundaries
chunk-size-preset: low-budget
scoring: evidence-hybrid
selector: budget-top-k (TokenPack hybrid-greedy)

Historical selectors such as knapsack, knapsack-redundancy, and semantic-threshold chunking remain available for ablation work, but the main pipeline is hybrid-greedy.

Python API

from tokenpack.embeddings import make_embedder
from tokenpack.pipeline import ingest_path
from tokenpack.scoring import score_chunks
from tokenpack.selectors import select_chunks

embedder = make_embedder()
index = ingest_path(
    "README.md",
    ".tokenpack/readme-index.json",
    embedder=embedder,
    chunker_name="structure-aware",
    target_tokens=250,
    min_tokens=40,
    max_tokens=320,
)

query = "How does TokenPack reduce LLM context cost?"
query_embedding = embedder.embed([query])[0]

scored = score_chunks(
    query_embedding,
    index.chunks,
    index.embeddings,
    scoring="evidence-hybrid",
    query_text=query,
    redundancy_penalty=0.35,
)

result = select_chunks(
    scored,
    strategy="budget-top-k",
    budget=3000,
    candidate_pool=250,
)

print(result.used_tokens, [item.chunk.id for item in result.selected])

Headline Results

These are the cleanest results from the current paper artifacts. The paper is intentionally conservative: TokenPack-RAG does not claim universal knapsack dominance, but it does show that selection-first context packing is a strong budget-control layer.

Setting	Main Result
QASPER, matched ~50% saving	Only TokenPack preserves 0.934 evidence recall vs 0.713 for Only LLMLingua-2.
QASPER complete evidence	Only TokenPack preserves complete evidence on 0.870 of questions vs 0.120 for Only LLMLingua-2.
QASPER cascade frontier	TokenPack + LLMLingua-2 at rate 0.85 reaches 58.4% saving with 0.851 evidence recall.
LongBench v2 generation pilot	TP hybrid-greedy-50 answers 37/83 cases correctly vs 32/83 full context and 34/83 production-RAG, a +15.6% relative accuracy gain over full context with 50.6% saving.
LongBench aggressive cascade	TP hybrid-greedy-50 + LongLLMLingua-50 keeps the same 37/83 correctness while reaching 74.6% context saving on the 83-case eligible pilot.

The strongest claim is:

Select evidence first, then optionally compress it. Retrieval-time budget selection and prompt compression are not interchangeable.

Reproduce Paper Runs

Fast local tests:

python -m pytest -q

QASPER selector baseline:

python submission/experiments/qasper_selector_eval.py \
  --data-file .tokenpack/data/qasper-validation.parquet \
  --chunker structure-aware \
  --strategies production-rag,budget-top-k,greedy-density,knapsack,knapsack-redundancy \
  --budget-ratios 0.20,0.30,0.40,0.50 \
  --max-papers 500 \
  --max-questions 861 \
  --candidate-pool 300 \
  --chunk-size-preset low-budget \
  --output-dir submission/results/qasper_selector_eval_strong_rerun

LongBench v2 Modal pilot used in the current paper:

python -m modal run submission/longbench_eval/app.py::build_and_run \
  --output-dir submission/results/longbench_v2_modal_hybrid_greedy_83_latency \
  --limit 83 \
  --source-min-tokens 8000 \
  --source-max-tokens 24000 \
  --max-scanned 503 \
  --model-id Qwen/Qwen2.5-14B-Instruct \
  --batch-size 1 \
  --context-order score-then-source \
  --latency-mode

See submission/source_code_manifest.md for the full artifact map.

Repository Layout

src/tokenpack/                     Python package and CLI implementation
tests/                             Unit and smoke tests
examples/                          Small local examples for the CLI
submission/paper/                  LaTeX paper source, tables, figures
submission/experiments/            QASPER, LongBench, compression, and ablation scripts
submission/results/                Paper result artifacts and readouts
submission/longbench_eval/         Modal LongBench v2 generation harness
submission/modal_generation_eval/  Modal QASPER generation/judge harness

Notes

The default workflow is output-first: create a packed context file and send that file to your own LLM.
Ollama is not required for pack; MCP support is optional and local-first.
QASPER metrics are evidence-retention and answer-token-retention proxies, not human-judged generated-answer quality.
LongBench v2 accuracy numbers are pilot-scale and should be read descriptively, not as statistically significant wins.
Evidence-hybrid scoring weights are engineering defaults. The paper calls out weight calibration as future work.
BudgetMem is discussed as related work; the old budgetmem-style proxy is kept only in tokenpack.scoring_experimental, not in the production CLI.

License

TokenPack-RAG is licensed under the Business Source License 1.1. See LICENSE.

Citation

If you use TokenPack-RAG in research, cite the paper PDF in submission/TokenPack-paper.pdf. A BibTeX entry will be added when the public preprint is available.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 13, 2026

0.1.1

May 13, 2026

This version

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenpack_rag-0.1.0.tar.gz (67.3 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenpack_rag-0.1.0-py3-none-any.whl (60.3 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file tokenpack_rag-0.1.0.tar.gz.

File metadata

Download URL: tokenpack_rag-0.1.0.tar.gz
Upload date: May 13, 2026
Size: 67.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tokenpack_rag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`81d61a6be30c0b61a62e0d6f43eebb36ad96c04cc36a94fd036607f3b58db81c`
MD5	`c65c0ce68bde95c6247c55c26507706c`
BLAKE2b-256	`687013483eb6670362b43de0afffebcb903a8f34a2a61a99b4e52a6fba0eac1c`

See more details on using hashes here.

File details

Details for the file tokenpack_rag-0.1.0-py3-none-any.whl.

File metadata

Download URL: tokenpack_rag-0.1.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 60.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tokenpack_rag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ff6ddc20f45ce552cbcc27ddf3f8c58598a15b2e910823778322d2723a1113c`
MD5	`35cfc766d7f90095e28483c8e8d80c57`
BLAKE2b-256	`5ddce71ed4b566a22c0b17b6f88aadacc66117014c3f6cdf7939d758d089d5a9`

See more details on using hashes here.

tokenpack-rag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TokenPack-RAG

What You Get

Install

Quick Start

Supported Inputs

Auto Budget

Output Files

Optional Compression

Use With Agents / MCP

Advanced CLI

Python API

Headline Results

Reproduce Paper Runs

Repository Layout

Notes

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes