Skip to main content

Pre-compile office documents into compact knowledge graphs for LLM sessions

Project description

Foliograph

Pre-compile office documents into compact knowledge graphs for LLM sessions.

Inspired by Graphify for code, Foliograph does the same for office documents: .docx, .pdf, .pptx, .md, and .txt.

Instead of loading entire documents into every session, you build the graph once and navigate by index. Token costs drop by 60-90% on document-heavy projects.

The skill generates the graph. You keep the graph. The only thing you install is SKILL.md.


Get started in 3 steps (claude.ai)

No terminal. No installation. Works entirely in your browser.

Step 1: Download SKILL.md

Download SKILL.md from this repository, then click the Raw button and save the file.

Step 2: Add it to your Claude Project

  1. Go to claude.ai and open or create a Project
  2. Click the project name at the top of the left sidebar
  3. Click Add content (or the + icon next to Files)
  4. Upload SKILL.md
  5. That is it. The skill is now active for every conversation in this Project.

Step 3: Use it

Upload any .docx, .pdf, .pptx, .md, or .txt file into a conversation and say:

foliograph this

You will get FOLIO_TIPS.md with your document map, key concepts, token savings, and ready-made commands. Ask for a visual dashboard with:

Show me an executive dashboard

What you need:

  • A claude.ai account (Free, Pro, or Team)
  • A Project (available on all plans)
  • The SKILL.md file from this repo

Demo

See Foliograph in action: Watch the Demo Video


The Problem

Every new LLM session on a large document project starts blind. You paste the whole chapter, the whole spec, the whole report, because you don't know what the model will need. By message three you've burned most of your context window on content the model never touched.

Foliograph fixes this structurally:

Without Foliograph:
  Session start -> paste Chapter 4 (8,000 tokens) -> ask one question -> done
  Next session  -> paste Chapter 4 again (8,000 tokens) -> ...

With Foliograph:
  Session start -> load FOLIO_GRAPH.md (~400 tokens) -> "load Chapter 4 > The Swarm Model"
               -> fetch only that section (~600 tokens) -> done

The graph is built once. Every subsequent session pays only the index cost.


Quickstart (Python CLI)

pip install foliograph
foliograph build my_project/ --name "My Project"

This produces three files in your working directory:

File Purpose
FOLIO_GRAPH.md Structural skeleton of every document: headings, summaries, word counts, figures, tables
FOLIO_INDEX.md Concept to location index (168+ entries for a typical book)
FOLIO_SESSION.md Copy-paste session starter prompt for any LLM

Installation

# Core (no heavy dependencies)
pip install foliograph

# With Python library support for each format
pip install "foliograph[docx]"
pip install "foliograph[pdf]"
pip install "foliograph[pptx]"
pip install "foliograph[xlsx]"
pip install "foliograph[all]"

CLI usage

# Single file
foliograph build report.docx --name "Q3 Report"

# Multiple files
foliograph build chapter1.docx chapter2.docx appendix.pdf --name "My Book"

# Entire directory (recursive)
foliograph build ./manuscript/ --output ./graph/ --name "My Book"

# Check for drift against existing graph
foliograph check --graph FOLIO_GRAPH.md

# Fetch a specific section to stdout
foliograph fetch "chapter4.docx > The Swarm Model"

# Token savings stats
foliograph stats FOLIO_GRAPH.md

# Generate HTML savings dashboard
foliograph stats-html FOLIO_GRAPH.md

Python API

from foliograph.builder import build
from foliograph.extractor import extract

# Build graph from a list of files or directories
outputs = build(
    sources=["chapter1.docx", "appendix.pdf", "./slides/"],
    output_dir="./graph/",
    project_name="My Project",
)

# Extract a single document
rec = extract("report.docx")
print(rec.title)
print(rec.total_words)
for section in rec.sections:
    print(f"  {'  ' * section.level}{section.title} ({section.word_count}w)")

How to use the graph in a session

  1. Start every session by pasting the content of FOLIO_SESSION.md
  2. Ask questions by concept: "What does the book say about Channel Siloing?"
  3. Load sections on demand: "Load escalation_intelligence.md > The Swarm Model"
  4. Never reload a section you've already discussed in the session

Supported formats

Format Extension Extraction method
Word Document .docx extract-text / python-docx
PDF .pdf pdftotext / pdfminer.six
PowerPoint .pptx extract-text / python-pptx
Excel Workbook .xlsx openpyxl
Markdown .md Native parser
Plain Text .txt Native parser

Output format

FOLIO_GRAPH.md (structure map)

### `chapter4.docx` [DOCX]
**Title:** The Swarm Model
**Words:** 2,847

**Structure:**
- **The Swarm Model**
  > Replacing the Hierarchy with Parallel Expert Engagement.
  - **Why Sequential Escalation Fails at Scale** (187w)
    > The sequential model has a structural bottleneck at every tier boundary.
  - **How AI Assembles the Swarm** (312w)
    > Swarm assembly uses four criteria evaluated simultaneously.

**Key Terms:** Algorithmic Friction, Agent Churn, Escalation Debt, Feedback Loop

FOLIO_INDEX.md (concept index)

### S

- **Sentiment Drift** -> `chapter2.docx > Signal 1: Sentiment Drift`
- **Swarm Model** -> `chapter4.docx > The Swarm Model`

Real-world example

The examples/sample/ directory contains a worked example showing Foliograph output on a plain markdown document. Open FOLIO_GRAPH.md and FOLIO_INDEX.md to see the structure.


Architecture

foliograph/
├── extractor.py     # Per-format extraction -> DocumentRecord
├── builder.py       # DocumentRecord[] -> FOLIO_GRAPH.md + FOLIO_INDEX.md
├── relationships.py # Cross-document relationship mapping
├── drift.py         # Graph drift detection
├── stats_html.py    # Token savings HTML dashboard
└── cli.py           # foliograph build / check / fetch / stats

Contributing

Contributions welcome. The most valuable additions are:

  • Better named-entity extraction
  • .xlsx support (sheet names, column headers, key cell ranges)
  • Google Docs / Notion export support
  • foliograph update command for incremental rebuilds

Open an issue before starting a large feature. Some of these are already in progress.

git clone https://github.com/prasad-m-k/foliograph
cd foliograph
pip install -e ".[dev]"
pytest tests/

License

MIT. See LICENSE.


Author

Prasad MK Research: ssrn.com/author=10270516


Acknowledgements

Foliograph is directly inspired by Graphify by Safi Shamsi, which demonstrated the same approach for codebases. The core insight is to pay the indexing cost once, query from the graph every session, and that insight belongs to that project. Foliograph extends it to office documents and to claude.ai chat environments where no terminal or IDE is available.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

foliograph-0.5.0.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

foliograph-0.5.0-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file foliograph-0.5.0.tar.gz.

File metadata

  • Download URL: foliograph-0.5.0.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for foliograph-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a5aa08224933a637875bd14e5db0429ad4184ddb98a7c0d18ee7af2b24629aac
MD5 7fb4972d98d4fe758040331a24030788
BLAKE2b-256 455c10c91a595bc02bbce36de99f4c36969fc512a5cd49cd11c009fdfc264b06

See more details on using hashes here.

File details

Details for the file foliograph-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: foliograph-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for foliograph-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dfe33a7d317934abdf96e818cdf07e5cad4af4df0948d254d40fab5fc1e1c8ff
MD5 d9b9ddda5b5c8cfac14b25e5bfb2b197
BLAKE2b-256 97bd2801106d312487aa03854d825a0989614ed62a94c3dd571efdfe8e71236a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page