Document summarization library using LLM-powered strategies with hierarchical structure preservation

Project description

Compakt

Document summarization library that reads PDF and Markdown files, preserves their hierarchical structure, and produces LLM-powered summaries using OpenAI models.

Compakt parses documents into structured chunks, embeds them for similarity search, then applies the best-fit summarization strategy — whether the document has rich headers, no structure at all, or is small enough to process whole.

Features

Hierarchical structure preservation — Extracts and maintains document headers, sections, and subsections throughout the summarization pipeline
Three-strategy system — Automatically selects the best approach per document:
- Brute Force for small documents (< 50k tokens) — sends full text to the LLM
- Structured Markdown for documents with headers — scope-filtered retrieval per section using fuzzy matching and elbow-filtered similarity search
- Fallback Unstructured for headerless documents — global similarity search with synthetic structure
Configurable granularity — level parameter (1–3) controls summary depth: sections only, subsections, or down to H4 headers
Sync and async clients — Compakt for synchronous use, AsyncCompakt for async workflows and batch processing
Pluggable architecture — Protocol-based interfaces let you swap any component (file reader, embeddings, vector index, summarizer) without changing the pipeline
PDF and Markdown input — Reads local files and HTTP/HTTPS URLs

Installation

Requires Python 3.13+.

pip install compakt

Or with uv:

uv add compakt

Environment Setup

Compakt uses OpenAI models by default. Set your API key:

export OPENAI_API_KEY="your-api-key"

Or create a .env file in your project root:

OPENAI_API_KEY=your-api-key

Quick Start

Basic Usage

from compakt import Compakt

compakt = Compakt()
result = compakt.summarize("path/to/document.pdf", level=2)

print(result.summary)
print(f"Strategy used: {result.artifacts.strategy}")
print(f"Chunks processed: {len(result.artifacts.chunks)}")

Async Usage

import asyncio
from compakt import AsyncCompakt

async def main():
    compakt = AsyncCompakt()
    result = await compakt.summarize("path/to/document.pdf", level=2)
    print(result.summary)

asyncio.run(main())

Batch Processing

import asyncio
from compakt import AsyncCompakt

async def main():
    compakt = AsyncCompakt()
    files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
    semaphore = asyncio.Semaphore(4)

    async def process(path):
        async with semaphore:
            return await compakt.summarize(path, level=2)

    results = await asyncio.gather(*[process(f) for f in files])
    for r in results:
        print(r.summary)

asyncio.run(main())

API Reference

`Compakt` / `AsyncCompakt`

Compakt(
    brute_force_token_limit: int = 50_000,
    file_reader: FileReaderAsMarkdown | None = None,
    markdown_tree_parser: MarkdownTreeParser | None = None,
    text_splitter: TextSplitter | None = None,
    vector_index: VectorIndex | None = None,
    strategies: list[SummarizationStrategy] | None = None,
    brute_force_strategy: SummarizationStrategy | None = None,
    encoder: Encoder | None = None,
    chat_model: str = "gpt-4.1-mini",
    embedding_model: str = "text-embedding-3-small",
    encoding_name: str = "cl100k_base",
)

Pass None for any component to use the built-in default. Override specific components while keeping defaults for everything else.

Methods

Method	Description
`summarize(file_path, level=2, retrieval_k=20)`	Summarize a document. Returns `CompaktRunResult`.
`create_tree(markdown)`	Parse markdown string into a header tree (`list[HeaderNode]`).
`count_tokens(text)`	Count tokens using the configured encoder.

Parameters:

file_path — Path to a PDF or Markdown file (or HTTP/HTTPS URL)
level — Summary granularity: 1 = sections, 2 = subsections, 3 = H4 headers
retrieval_k — Number of top-k chunks to retrieve before elbow filtering (default: 20)

`CompaktRunResult`

result.summary       # str — The generated summary
result.artifacts     # CompaktRunArtifacts

`CompaktRunArtifacts`

artifacts.markdown            # str — Raw markdown from file reader
artifacts.markdown_tree       # list[HeaderNode] — Parsed header tree
artifacts.chunks              # list[CompaktChunk] — Text chunks
artifacts.embeddings          # list[CompaktEmbeddingEntry] — Chunk embeddings
artifacts.retrieved_chunks    # dict[str, list[CompaktChunk]] — Chunks retrieved per section
artifacts.document_structure  # DocumentStructure | None — Resolved structure
artifacts.strategy            # str — Name of the strategy used

Architecture

Compakt follows the Ports & Adapters pattern. All core abstractions are Python Protocol classes in src/compakt/core/interfaces/, with concrete implementations in src/compakt/core/adapters/.

Pipeline Flow

File (PDF/MD) → FileReader → Raw Markdown
                                  ↓
                          MarkdownTreeParser → Header Tree
                                  ↓
                            TextSplitter → Chunks
                                  ↓
                            VectorIndex → Embedded & Indexed Chunks
                                  ↓
                    SummarizationStrategy (auto-selected)
                                  ↓
                          CompaktRunResult

Strategy Selection

Strategies are evaluated in order. The first whose can_handle() returns True is used:

BruteForceUnstructuredStrategy — If total tokens ≤ brute_force_token_limit
StructuredMarkdownStrategy — If the document has headers
FallbackUnstructuredStrategy — If the document has no headers

Default Components

Component	Default Implementation
File Reader	`PyMuPDFMarkdownFileReader` (pymupdf4llm)
Tree Parser	`MarkdownItTreeParser` (markdown-it-py)
Text Splitter	`LangchainMarkdownTextSplitter`
Encoder	`TiktokenEncoder` (cl100k_base)
Embeddings	`OpenAIEmbeddings` (text-embedding-3-small)
Vector Index	`InMemoryVectorIndex` (cosine similarity)
Structure Resolver	`OpenAIDocumentStructureResolver` (gpt-4.1-mini)
Summarizer	`OpenAISummarizer` (gpt-4.1-mini)

Custom Components

Implement any Protocol from compakt.core.interfaces and pass it to the client:

from compakt import Compakt
from compakt.core.interfaces import Embeddings

class MyEmbeddings:
    def embed(self, payload):
        # your implementation
        ...

    async def aembed(self, payload):
        ...

compakt = Compakt(
    # swap just the embeddings, keep everything else default
    vector_index=InMemoryVectorIndex(MyEmbeddings()),
)

Development

# Clone and install
git clone https://github.com/justkiet/compakt.git
cd compakt
uv sync

# Run tests
uv run python -m pytest tests/

# Run a single test
uv run python -m pytest tests/test_compakt_integration.py::CompaktIntegrationTest::test_method_name

# Run an example
uv run python examples/basic_usage.py path/to/file.pdf --level 2

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compakt-0.1.0.tar.gz (38.2 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

compakt-0.1.0-py3-none-any.whl (44.6 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file compakt-0.1.0.tar.gz.

File metadata

Download URL: compakt-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 38.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for compakt-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fe7b2bdae13652c0aeca797dd8acd0b4cc5fd37aab5e803580fe937dd8a62271`
MD5	`e31ed617350a787f3b53d5c8a288a4ba`
BLAKE2b-256	`e93451c672369c5cc7189f93044332426fe72df7ae8832340002e696b3f92c64`

See more details on using hashes here.

File details

Details for the file compakt-0.1.0-py3-none-any.whl.

File metadata

Download URL: compakt-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 44.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for compakt-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20b9087bb2ee04b1e57ba94bb373b1c95011026aa5042904db5efd89905db10b`
MD5	`3eb3d4b4acae5ffb3dc8d01082835333`
BLAKE2b-256	`40f9d421dac82b567b63fa861c6e970e1b72dee3edbcfe1fcf6c18523b9f81f2`

See more details on using hashes here.

compakt 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Compakt

Features

Installation

Environment Setup

Quick Start

Basic Usage

Async Usage

Batch Processing

API Reference

Compakt / AsyncCompakt

Methods

CompaktRunResult

CompaktRunArtifacts

Architecture

Pipeline Flow

Strategy Selection

Default Components

Custom Components

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Compakt` / `AsyncCompakt`

`CompaktRunResult`

`CompaktRunArtifacts`