Query-aware semantic chunk selection under LLM context-window budgets.

These details have not been verified by PyPI

Project links

Project description

TokenPack-RAG

Turn long files into compact, evidence-dense LLM context.

Package coverage Local MCP server Supported inputs Business Source License 1.1

TokenPack-RAG selects the most useful parts of documents, code, PDFs, tables, and folders under a strict token budget. It does not call an LLM during packing: it runs local embeddings, evidence scoring, and budget-aware selection, then writes a Markdown context file you can give to any LLM or agent.

In plain English, TokenPack-RAG does three things:

Step	What happens
Split intelligently	Breaks the source into chunks that respect headings, paragraphs, code blocks, and semantic shifts.
Score by evidence value	Ranks chunks by how useful they look for your query, using semantic similarity, keyword support, document position, and structure signals.
Pack the best context	Fills your token budget with the highest-value chunks first, avoiding the waste of blindly pasting everything.

Internally, the default pipeline is:

structure-aware semantic chunks + evidence-hybrid scoring + hybrid-greedy packing

TokenPack + LongLLMLingua saves 74.6% context tokens, keeps a +15.6% relative pilot lift, gives a 3.90x latency speedup, and saves about $1.86 per 1M input tokens

Why Use It

Long-context LLMs make it tempting to paste everything into the prompt. In practice, that is expensive, slow, and often noisy. Naive RAG has the opposite problem: top-k retrieval can collect locally relevant chunks while missing the best global use of a fixed token budget.

TokenPack-RAG is built for that middle layer:

Turns a file or folder into a compact, LLM-ready context file with one command.
Selects globally useful evidence under a token budget instead of blindly taking top-k chunks.
Reduces redundant or low-utility context before it reaches the LLM.
Helps agents work with large local workspaces through MCP without uploading everything.
Supports broad real-world inputs: docs, code, PDFs, HTML, CSV/JSON, and Office files.
Can optionally run LLMLingua / LongLLMLingua after evidence selection for extra compression.

Install

Basic install:

pip install tokenpack-rag

30-Second Start

Pick the path that matches what you want:

Goal	Use this	Output
Fast default	Selection-only packing. No LLM call, no prompt-compression model.	`paper-tp.md`
Best combination	TokenPack selection + LongLLMLingua compression for the strongest current context-saving setup.	smaller `paper-tp.md`
Folder pack	Pack a whole project or document folder into one context file.	`docs-tp.md`

Fast default

Use this first for most documents:

tokenpack-rag pack paper.pdf --query "What are the main contributions?"

Writes:

paper-tp.md

Best combination

Use this when you want the most aggressive current setup from the paper-style experiments: select the best evidence first, then compress the selected context with LongLLMLingua.

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --longllmlingua \
  --compression-rate 0.50 \
  --overwrite

This is the setup behind the headline result: about 74.6% context-token saving, 3.90x mean-latency speedup, and roughly $1.86 saved per 1M input tokens at the paper's illustrative $2.50 / 1M input tokens price, while retaining TokenPack's +15.6% relative pilot lift over full-context prompting. It requires the compression extra and a local/cached compression model unless you intentionally add --allow-download.

Folder pack

Use this for a repo, notes folder, or mixed document set:

tokenpack-rag pack docs/ --query "Summarize the design decisions in this project."

Writes:

docs-tp.md

Manual budget

Use this when you already know your target context size:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --budget 32000 \
  --overwrite

The output is a packed Markdown context file, not a modified PDF. You can paste it into a chat model, upload it to your own LLM workflow, or let an agent read it through MCP.

Results Snapshot

TokenPack-RAG results table

Technical result details behind the summary

Setting	Technical Result
Relevant evidence kept	TokenPack preserves 93.4% of QASPER evidence vs 71.3% for compression-only.
All required evidence kept	TokenPack keeps complete evidence for 87.0% of QASPER questions vs 12.0% for compression-only.
Selection + compression	TokenPack + LLMLingua-2 reaches 58.4% context saving while keeping 85.1% of required evidence.
Pilot answer accuracy	On an 83-case LongBench v2 pilot, TokenPack improves relative accuracy by 15.6% over full-context prompting while saving 50.6% context.
Aggressive cascade	TokenPack + LongLLMLingua reaches 74.6% context saving while retaining TokenPack's +15.6% relative pilot lift over full context.
Latency impact	The same cascade reduces mean total latency from 4.140s to 1.060s, a 3.90x speedup in the pilot.
Cost-scale example	At the paper's illustrative $2.50 per 1M input-token price, the cascade reduces 1M paid input tokens to about 254k, saving about $1.86.

The practical takeaway: pack the useful evidence first, then optionally compress it. This is different from blindly compressing the whole retrieved context.

For the full methodology, tables, limitations, and experiment details, read the paper: submission/TokenPack-paper.pdf.

Use With Agents / MCP

Run TokenPack-RAG as a local stdio MCP server:

tokenpack-rag-mcp --workspace /path/to/project

Example MCP config:

{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "tokenpack-rag-mcp",
      "args": ["--workspace", "/path/to/project"]
    }
  }
}

Or use uvx without a permanent install:

{
  "mcpServers": {
    "tokenpack-rag": {
      "command": "uvx",
      "args": [
        "--from",
        "tokenpack-rag[mcp,pdf,office,tokens]",
        "tokenpack-rag-mcp",
        "--workspace",
        "/path/to/project"
      ]
    }
  }
}

MCP tools:

Tool	Purpose
`pack_context`	Packs a file or folder into Markdown context and writes the `-tp.md` artifact.
`read_packed_context`	Reads a packed context artifact, optionally in slices for large files.

By default the MCP server can only read and write inside --workspace. Use --allow-any-path only for trusted local setups.

Supported Inputs

TokenPack-RAG accepts a single file or a folder. Folder inputs are scanned recursively and unsupported binary/media files are skipped.

Category	Extensions
Text and docs	`.txt`, `.text`, `.md`, `.markdown`, `.rst`, `.adoc`, `.tex`, `.log`
PDF	`.pdf` with the `pdf` extra
Web	`.html`, `.htm`
Data/config	`.json`, `.jsonl`, `.csv`, `.tsv`, `.yaml`, `.yml`, `.toml`
Office	`.docx`, `.pptx`, `.xlsx` with the `office` extra
Code	`.py`, `.js`, `.jsx`, `.ts`, `.tsx`, `.java`, `.go`, `.rs`, `.c`, `.cpp`, `.cs`, `.php`, `.rb`, `.swift`, `.kt`, `.scala`, `.sh`, `.ps1`, `.sql`, `.css`, `.xml`, and related variants

Auto Budget

--budget is optional. When omitted, TokenPack-RAG estimates a context budget from the source:

source_tokens = sum(chunk.token_count for chunk in index.chunks)
raw_budget = ceil(source_tokens * 0.50)
budget = clamp(raw_budget, min_budget=1200, max_budget=64000)
reserve_output = min(4000, max(512, int(budget * 0.10)))
selection_budget = budget - reserve_output

Example terminal summary:

Source: paper.pdf
Output: paper-tp.md
Source tokens: 142,000
Auto budget: 64,000 tokens (ratio=50%, capped by max-budget)
Reserved for answer: 4,000
Selection budget: 60,000
Selected: 188 chunks / 59,240 tokens

Useful controls:

tokenpack-rag pack paper.pdf --query "..." --budget-ratio 0.35
tokenpack-rag pack paper.pdf --query "..." --max-budget 128000
tokenpack-rag pack paper.pdf --query "..." --reserve-output 2000

Output Files

Default output paths:

Source	Output
`paper.pdf`	`paper-tp.md`
`notes.txt`	`notes-tp.md`
`docs/`	`docs-tp.md`

Existing outputs are protected:

tokenpack-rag pack paper.pdf --query "..."

If paper-tp.md exists, the command stops. Use:

tokenpack-rag pack paper.pdf --query "..." --overwrite
tokenpack-rag pack paper.pdf --query "..." --out packed-context.md

Internal artifacts go under .tokenpack/runs/<timestamp>/ unless paths are provided:

tokenpack-rag pack paper.pdf \
  --query "..." \
  --index-out .tokenpack/paper.index.json \
  --selection-out paper-tp.selection.json

The default Markdown is intentionally clean: it keeps the query, source, selected-token summary, and source/page markers, but leaves chunk ids, token counts, and artifact paths out of the LLM context. Use debug output only when you are inspecting the pipeline:

tokenpack-rag pack paper.pdf --query "..." --output-detail debug
tokenpack-rag pack paper.pdf --query "..." --output-detail none

Optional Compression

TokenPack-RAG is selection-first by default. You can optionally compress the selected evidence:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --compression-rate 0.85

LongLLMLingua-style query-conditioned compression:

tokenpack-rag pack paper.pdf \
  --query "What evidence supports the main claim?" \
  --compress llmlingua \
  --longllmlingua \
  --compression-rate 0.85

By default, compression models are expected to be cached locally. Add --allow-download only when you intentionally want Hugging Face downloads during compression.

Python API

from tokenpack.embeddings import make_embedder
from tokenpack.pipeline import ingest_path
from tokenpack.scoring import score_chunks
from tokenpack.selectors import select_chunks

embedder = make_embedder()
index = ingest_path(
    "README.md",
    ".tokenpack/readme-index.json",
    embedder=embedder,
    chunker_name="structure-aware",
    target_tokens=250,
    min_tokens=40,
    max_tokens=320,
)

query = "How does TokenPack reduce LLM context cost?"
query_embedding = embedder.embed([query])[0]

scored = score_chunks(
    query_embedding,
    index.chunks,
    index.embeddings,
    scoring="evidence-hybrid",
    query_text=query,
    redundancy_penalty=0.35,
)

result = select_chunks(
    scored,
    strategy="budget-top-k",
    budget=3000,
    candidate_pool=250,
)

print(result.used_tokens, [item.chunk.id for item in result.selected])

Advanced CLI

The one-command pack workflow is the main user-facing interface. Lower-level commands remain available for experiments and reproducible paper runs.

tokenpack-rag ingest README.md --index .tokenpack/readme-index.json

tokenpack-rag select \
  --index .tokenpack/readme-index.json \
  --query "How does TokenPack reduce LLM context cost?" \
  --budget 3000 \
  --reserve-output 500 \
  --output .tokenpack/selection.json

tokenpack-rag export-context \
  --selection .tokenpack/selection.json \
  --output .tokenpack/context.txt

Defaults:

chunker: structure-aware semantic boundaries
chunk-size-preset: low-budget
scoring: evidence-hybrid
selector: budget-top-k (TokenPack hybrid-greedy)

Historical selectors such as knapsack, knapsack-redundancy, and semantic-threshold chunking remain available for ablation work, but the main pipeline is hybrid-greedy.

Reproduce Paper Runs

LongBench v2 Modal pilot used in the current paper:

python -m modal run submission/longbench_eval/app.py::build_and_run \
  --output-dir submission/results/longbench_v2_modal_hybrid_greedy_83_latency \
  --limit 83 \
  --source-min-tokens 8000 \
  --source-max-tokens 24000 \
  --max-scanned 503 \
  --model-id Qwen/Qwen2.5-14B-Instruct \
  --batch-size 1 \
  --context-order score-then-source \
  --latency-mode

See submission/source_code_manifest.md for the full artifact map.

Repository Layout

src/tokenpack/                     Python package and CLI implementation
tests/                             Unit and smoke tests
assets/                            README visual result assets
examples/                          Small local examples for the CLI
submission/paper/                  LaTeX paper source, tables, figures
submission/experiments/            QASPER, LongBench, compression, and ablation scripts
submission/results/                Paper result artifacts and readouts
submission/longbench_eval/         Modal LongBench v2 generation harness
submission/modal_generation_eval/  Modal QASPER generation/judge harness

Notes

The default workflow is output-first: create a packed context file and send that file to your own LLM.
Ollama is not required for pack; MCP support is optional and local-first.
Evidence-hybrid scoring weights are engineering defaults. The paper calls out weight calibration as future work.

Limitations

The LLM answer-quality experiments are pilot-scale and were not fully human-reviewed.
QASPER results primarily measure evidence preservation, not end-to-end human-judged answer quality.
LongBench v2 results are descriptive pilot results, not a statistically definitive benchmark claim.
TokenPack-RAG improves context selection, but it cannot recover information that is missing from the source or unreadable after extraction.
The default scoring weights are engineering defaults; stronger calibration is future work.

License

TokenPack-RAG is licensed under the Business Source License 1.1. See LICENSE.

Citation

If you use TokenPack-RAG in research, cite the paper PDF in submission/TokenPack-paper.pdf. A BibTeX entry will be added when the public preprint is available.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

May 13, 2026

0.1.1

May 13, 2026

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenpack_rag-0.1.2.tar.gz (70.3 kB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokenpack_rag-0.1.2-py3-none-any.whl (62.2 kB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file tokenpack_rag-0.1.2.tar.gz.

File metadata

Download URL: tokenpack_rag-0.1.2.tar.gz
Upload date: May 13, 2026
Size: 70.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tokenpack_rag-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e03f9502fadc3304df9a0ac5fb46df1786041b5939d54fc735c57b3a7332dea9`
MD5	`efa0c6a406a7e9198112c78d3971ead1`
BLAKE2b-256	`3658cba98c8ff62279afc20f8657292d72671790bdb0bed04878f6583e3692f5`

See more details on using hashes here.

File details

Details for the file tokenpack_rag-0.1.2-py3-none-any.whl.

File metadata

Download URL: tokenpack_rag-0.1.2-py3-none-any.whl
Upload date: May 13, 2026
Size: 62.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for tokenpack_rag-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5930967c0984b0c18e2d23144de6470fe25da54a5f1343fe4d1f6fa20944cc22`
MD5	`2d2f9ed63fea98a27dea219b75cc9719`
BLAKE2b-256	`231e2376b1453ba75efa1469b3de3da1b9613ed24faf6f469ef62d6531c50095`

See more details on using hashes here.

tokenpack-rag 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TokenPack-RAG

Why Use It

Install

30-Second Start

Results Snapshot

Use With Agents / MCP

Supported Inputs

Auto Budget

Output Files

Optional Compression

Python API

Advanced CLI

Reproduce Paper Runs

Repository Layout

Notes

Limitations

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes