Skip to main content

Personal content digester framework: fetch → extract → LLM summarize → sink

Project description

English | 日本語

digestkit

Personal content digester framework: fetch → extract → LLM summarize → sink

CI License: Apache 2.0 Python Ruff

Installation

Note: digestkit is not yet published to PyPI. Until the first PyPI release, install directly from the umbrella repository's main branch using a git URL.

From git (current)

pip install "digestkit @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"

With optional extras:

pip install "digestkit[pdf,notion] @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"

For uv projects, declare it under [tool.uv.sources]:

[project]
dependencies = ["digestkit"]

[tool.uv.sources]
digestkit = { git = "https://github.com/koki-nakamura22/inboxkit.git", subdirectory = "packages/digestkit", branch = "main" }

Pin to a specific commit for reproducibility by replacing branch = "main" with rev = "<sha>".

From PyPI (planned)

Once digestkit is published, the standard install path will be:

pip install digestkit
pip install digestkit[pdf,notion]

Tracking issue: #3.

Quickstart

Create a .env file with your LLM API key:

ANTHROPIC_API_KEY=sk-ant-...

Define and run your digester:

from digestkit import Digester
from digestkit.sources import LocalDirectorySource
from digestkit.extractors import PDFExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks import SQLiteSink

class PdfDigester(Digester):
    source = LocalDirectorySource("./papers", glob="*.pdf")
    extractor = PDFExtractor()
    summarizer = LLMSummarizer(provider="anthropic", model="claude-haiku-4-5")
    sink = SQLiteSink("digests.db")

if __name__ == "__main__":
    PdfDigester().run()

Programmatic construction (constructor injection)

For dynamic configuration (config files, CLI flags, tests with swapped components), pass the four core dependencies as constructor kwargs instead of subclassing:

digester = Digester(
    source=LocalDirectorySource("./papers", glob="*.pdf"),
    extractor=PDFExtractor(),
    summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
    sink=SQLiteSink("digests.db"),
)
digester.run()

Both styles are supported and can be mixed: when a subclass defines class attributes, any kwarg passed to __init__ overrides them (kwarg wins). This is the same hybrid pattern used by seen_store and dedup_key.

Notion DB → web fetch → summarize → Slack

A common pipeline: walk a Notion database's URL property, fetch + summarize each page, post to Slack. Specifying NotionDatabaseSource(url_property=...) makes item.payload a URL string so WebPageExtractor connects directly (the original Notion page object remains available at item.metadata["page"]):

from digestkit import Digester
from digestkit.sources.notion_database import NotionDatabaseSource
from digestkit.extractors.webpage import WebPageExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks.slack import SlackSink

digester = Digester(
    source=NotionDatabaseSource(
        database_id="<your-db-id>",
        url_property="URL",                 # ← payload を URL 文字列にするモード
        status_property="Status",
        status_value_success="処理済み",
        query_filter={"property": "Status", "select": {"equals": "未読"}},
    ),
    extractor=WebPageExtractor(),
    summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
    sink=SlackSink(webhook_url="https://hooks.slack.com/..."),
)
digester.run()

NotionDatabaseSource transparently handles Notion 3.x's Data Sources API (data_sources/{id}/query). The first fetch() makes a single databases.retrieve call to detect whether data_sources are present and caches the result on the instance (no further retrieve calls after that). DBs created on 3.x use the new API; legacy DBs fall back to the older databases/{id}/query automatically — callers don't need to think about API versions.

Long documents (chunked / map-reduce)

For documents that exceed a model's context window (long PDFs, book chapters), use ChunkedLLMSummarizer. It splits the input, summarizes each chunk (map), and recursively merges the partial summaries (reduce). Inputs that fit in the window fall back to a single LLM call automatically.

from digestkit.summarizers import ChunkedLLMSummarizer

summarizer = ChunkedLLMSummarizer(
    provider="anthropic",
    model="claude-haiku-4-5",
    chunk_size=80_000,    # tokens per chunk; defaults to model max - reserve_tokens
    chunk_overlap=0,
    prompts=ChunkedLLMSummarizer.DEFAULT_PROMPTS,  # opt-in length control
    default_length="standard",
)

length ("short" / "standard" / "detailed") is applied only at the final reduce step; intermediate stages use a neutral merge prompt to avoid over-compressing mid-pipeline. On a per-chunk LLM failure the call fails fast with the chunk index in the error message.

Anthropic prompt caching (cache_control)

LLMSummarizer.system_prompt accepts either a str or a list of LiteLLM content blocks (list[dict]). The list form lets you enable Anthropic prompt caching (cache_control: {"type": "ephemeral"}) so that the input tokens of a long system prompt are billed at the cache-hit rate:

from digestkit.summarizers import LLMSummarizer

summarizer = LLMSummarizer(
    provider="anthropic",
    model="claude-sonnet-4-6",
    system_prompt=[
        {
            "type": "text",
            "text": "<long system prompt (JSON schema, output examples, ...)>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
)

Passing system_prompt as a plain str keeps the previous behavior (backward compatible).

If you just want to cache the entire system prompt without dealing with content blocks yourself, use the system_prompt_cache=True shortcut:

summarizer = LLMSummarizer(
    provider="anthropic",
    model="claude-sonnet-4-6",
    system_prompt="<long system prompt>",
    system_prompt_cache=True,   # auto-wraps the str in an ephemeral cache_control block
)

For finer control (caching only some of several blocks, etc.), use the list form shown above. system_prompt_cache=True and the list form are mutually exclusive (they fight over control of cache_control).

See the LiteLLM docs for details: https://docs.litellm.ai/docs/providers/anthropic#prompt-caching

Configuration

Set your LLM provider API key in a .env file (loaded automatically via python-dotenv):

ANTHROPIC_API_KEY=sk-ant-...  # Anthropic Claude
OPENAI_API_KEY=sk-...         # OpenAI GPT
GOOGLE_API_KEY=...            # Google Gemini

CLI

digestkit run my_digester.py

Architecture

digestkit implements a 1:1 pipeline: for each item fetched from the source, it extracts text, sends it to an LLM for summarization, and writes the result to the configured sink. Items that fail at any stage are collected in RunResult.failures; the pipeline continues rather than aborting on first error.

digestkit is the Phase 1 component of the inboxkit umbrella monorepo, which will also host future packages for RAG ingestion and personal knowledge bases.

Optional Dependencies

Extra Packages Use case
pdf pypdf Extract text from PDF files
web trafilatura, httpx Fetch and extract web articles
notion notion-client Fetch pages from Notion
slack httpx Fetch messages from Slack
email Fetch emails (IMAP/SMTP)
all all of the above Install everything

Install any extra with pip install digestkit[<extra>].

Contributing

See the umbrella CONTRIBUTING.md for development setup, lint / format / typecheck targets, and the pre-commit hook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digestkit-0.1.0.tar.gz (72.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

digestkit-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file digestkit-0.1.0.tar.gz.

File metadata

  • Download URL: digestkit-0.1.0.tar.gz
  • Upload date:
  • Size: 72.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for digestkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b4cd5c40d3be012cf2614d066e4b4f6e5232b40f4f7ec2d7acb857cf479ced8d
MD5 fb1d0582fd39ff9dcc7c66d809cb84be
BLAKE2b-256 ae836002acee86a9c21a14e7b32ef1050ef79bbfccbe2cecfbd68d528f062d92

See more details on using hashes here.

Provenance

The following attestation bundles were made for digestkit-0.1.0.tar.gz:

Publisher: publish.yml on koki-nakamura22/inboxkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file digestkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: digestkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for digestkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cc1eee58508d94f75028c2b710256cd9c2449ebac66c21bed50b18b42f69b6c
MD5 7a403408406505641ed05aef42e17715
BLAKE2b-256 2dd23cdb51b9b0ef82a96a78d9623f1a2413b1ec659773ccd8ebb420d44d085c

See more details on using hashes here.

Provenance

The following attestation bundles were made for digestkit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on koki-nakamura22/inboxkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page