Personal content digester framework: fetch → extract → LLM summarize → sink

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

koki-n22

These details have not been verified by PyPI

Project description

English | 日本語

digestkit

Personal content digester framework: fetch → extract → LLM summarize → sink

Python

Installation

Note: digestkit is not yet published to PyPI. Until the first PyPI release, install directly from the umbrella repository's main branch using a git URL.

From git (current)

pip install "digestkit @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"

With optional extras:

pip install "digestkit[pdf,notion] @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"

For uv projects, declare it under [tool.uv.sources]:

[project]
dependencies = ["digestkit"]

[tool.uv.sources]
digestkit = { git = "https://github.com/koki-nakamura22/inboxkit.git", subdirectory = "packages/digestkit", branch = "main" }

Pin to a specific commit for reproducibility by replacing branch = "main" with rev = "<sha>".

From PyPI (planned)

Once digestkit is published, the standard install path will be:

pip install digestkit
pip install digestkit[pdf,notion]

Tracking issue: #3.

Quickstart

Create a .env file with your LLM API key:

ANTHROPIC_API_KEY=sk-ant-...

Define and run your digester:

from digestkit import Digester
from digestkit.sources import LocalDirectorySource
from digestkit.extractors import PDFExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks import SQLiteSink

class PdfDigester(Digester):
    source = LocalDirectorySource("./papers", glob="*.pdf")
    extractor = PDFExtractor()
    summarizer = LLMSummarizer(provider="anthropic", model="claude-haiku-4-5")
    sink = SQLiteSink("digests.db")

if __name__ == "__main__":
    PdfDigester().run()

Programmatic construction (constructor injection)

For dynamic configuration (config files, CLI flags, tests with swapped components), pass the four core dependencies as constructor kwargs instead of subclassing:

digester = Digester(
    source=LocalDirectorySource("./papers", glob="*.pdf"),
    extractor=PDFExtractor(),
    summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
    sink=SQLiteSink("digests.db"),
)
digester.run()

Both styles are supported and can be mixed: when a subclass defines class attributes, any kwarg passed to __init__ overrides them (kwarg wins). This is the same hybrid pattern used by seen_store and dedup_key.

Notion DB → web fetch → summarize → Slack

A common pipeline: walk a Notion database's URL property, fetch + summarize each page, post to Slack. Specifying NotionDatabaseSource(url_property=...) makes item.payload a URL string so WebPageExtractor connects directly (the original Notion page object remains available at item.metadata["page"]):

from digestkit import Digester
from digestkit.sources.notion_database import NotionDatabaseSource
from digestkit.extractors.webpage import WebPageExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks.slack import SlackSink

digester = Digester(
    source=NotionDatabaseSource(
        database_id="<your-db-id>",
        url_property="URL",                 # ← payload を URL 文字列にするモード
        status_property="Status",
        status_value_success="処理済み",
        query_filter={"property": "Status", "select": {"equals": "未読"}},
    ),
    extractor=WebPageExtractor(),
    summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
    sink=SlackSink(webhook_url="https://hooks.slack.com/..."),
)
digester.run()

NotionDatabaseSource transparently handles Notion 3.x's Data Sources API (data_sources/{id}/query). The first fetch() makes a single databases.retrieve call to detect whether data_sources are present and caches the result on the instance (no further retrieve calls after that). DBs created on 3.x use the new API; legacy DBs fall back to the older databases/{id}/query automatically — callers don't need to think about API versions.

Long documents (chunked / map-reduce)

For documents that exceed a model's context window (long PDFs, book chapters), use ChunkedLLMSummarizer. It splits the input, summarizes each chunk (map), and recursively merges the partial summaries (reduce). Inputs that fit in the window fall back to a single LLM call automatically.

from digestkit.summarizers import ChunkedLLMSummarizer

summarizer = ChunkedLLMSummarizer(
    provider="anthropic",
    model="claude-haiku-4-5",
    chunk_size=80_000,    # tokens per chunk; defaults to model max - reserve_tokens
    chunk_overlap=0,
    prompts=ChunkedLLMSummarizer.DEFAULT_PROMPTS,  # opt-in length control
    default_length="standard",
)

length ("short" / "standard" / "detailed") is applied only at the final reduce step; intermediate stages use a neutral merge prompt to avoid over-compressing mid-pipeline. On a per-chunk LLM failure the call fails fast with the chunk index in the error message.

Anthropic prompt caching (cache_control)

LLMSummarizer.system_prompt accepts either a str or a list of LiteLLM content blocks (list[dict]). The list form lets you enable Anthropic prompt caching (cache_control: {"type": "ephemeral"}) so that the input tokens of a long system prompt are billed at the cache-hit rate:

from digestkit.summarizers import LLMSummarizer

summarizer = LLMSummarizer(
    provider="anthropic",
    model="claude-sonnet-4-6",
    system_prompt=[
        {
            "type": "text",
            "text": "<long system prompt (JSON schema, output examples, ...)>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
)

Passing system_prompt as a plain str keeps the previous behavior (backward compatible).

If you just want to cache the entire system prompt without dealing with content blocks yourself, use the system_prompt_cache=True shortcut:

summarizer = LLMSummarizer(
    provider="anthropic",
    model="claude-sonnet-4-6",
    system_prompt="<long system prompt>",
    system_prompt_cache=True,   # auto-wraps the str in an ephemeral cache_control block
)

For finer control (caching only some of several blocks, etc.), use the list form shown above. system_prompt_cache=True and the list form are mutually exclusive (they fight over control of cache_control).

See the LiteLLM docs for details: https://docs.litellm.ai/docs/providers/anthropic#prompt-caching

Configuration

Set your LLM provider API key in a .env file (loaded automatically via python-dotenv):

ANTHROPIC_API_KEY=sk-ant-...  # Anthropic Claude
OPENAI_API_KEY=sk-...         # OpenAI GPT
GOOGLE_API_KEY=...            # Google Gemini

CLI

digestkit run my_digester.py

Architecture

digestkit implements a 1:1 pipeline: for each item fetched from the source, it extracts text, sends it to an LLM for summarization, and writes the result to the configured sink. Items that fail at any stage are collected in RunResult.failures; the pipeline continues rather than aborting on first error.

digestkit is the Phase 1 component of the inboxkit umbrella monorepo, which will also host future packages for RAG ingestion and personal knowledge bases.

Optional Dependencies

Extra	Packages	Use case
`pdf`	pypdf	Extract text from PDF files
`web`	trafilatura, httpx	Fetch and extract web articles
`notion`	notion-client	Fetch pages from Notion
`slack`	httpx	Fetch messages from Slack
`email`	—	Fetch emails (IMAP/SMTP)
`all`	all of the above	Install everything

Install any extra with pip install digestkit[<extra>].

Contributing

See the umbrella CONTRIBUTING.md for development setup, lint / format / typecheck targets, and the pre-commit hook.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

koki-n22

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digestkit-0.1.0.tar.gz (72.3 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

digestkit-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file digestkit-0.1.0.tar.gz.

File metadata

Download URL: digestkit-0.1.0.tar.gz
Upload date: May 17, 2026
Size: 72.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for digestkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b4cd5c40d3be012cf2614d066e4b4f6e5232b40f4f7ec2d7acb857cf479ced8d`
MD5	`fb1d0582fd39ff9dcc7c66d809cb84be`
BLAKE2b-256	`ae836002acee86a9c21a14e7b32ef1050ef79bbfccbe2cecfbd68d528f062d92`

See more details on using hashes here.

Provenance

The following attestation bundles were made for digestkit-0.1.0.tar.gz:

Publisher: publish.yml on koki-nakamura22/inboxkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: digestkit-0.1.0.tar.gz
- Subject digest: b4cd5c40d3be012cf2614d066e4b4f6e5232b40f4f7ec2d7acb857cf479ced8d
- Sigstore transparency entry: 1561983389
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: koki-nakamura22/inboxkit@a6316e1f385b2c08bf72f48ca106add1d3284f7e
- Branch / Tag: refs/tags/digestkit-v0.1.0
- Owner: https://github.com/koki-nakamura22
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a6316e1f385b2c08bf72f48ca106add1d3284f7e
- Trigger Event: push

File details

Details for the file digestkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: digestkit-0.1.0-py3-none-any.whl
Upload date: May 17, 2026
Size: 36.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for digestkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cc1eee58508d94f75028c2b710256cd9c2449ebac66c21bed50b18b42f69b6c`
MD5	`7a403408406505641ed05aef42e17715`
BLAKE2b-256	`2dd23cdb51b9b0ef82a96a78d9623f1a2413b1ec659773ccd8ebb420d44d085c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for digestkit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on koki-nakamura22/inboxkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: digestkit-0.1.0-py3-none-any.whl
- Subject digest: 2cc1eee58508d94f75028c2b710256cd9c2449ebac66c21bed50b18b42f69b6c
- Sigstore transparency entry: 1561983518
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: koki-nakamura22/inboxkit@a6316e1f385b2c08bf72f48ca106add1d3284f7e
- Branch / Tag: refs/tags/digestkit-v0.1.0
- Owner: https://github.com/koki-nakamura22
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a6316e1f385b2c08bf72f48ca106add1d3284f7e
- Trigger Event: push

digestkit 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

digestkit

Installation

From git (current)

From PyPI (planned)

Quickstart

Programmatic construction (constructor injection)

Notion DB → web fetch → summarize → Slack

Long documents (chunked / map-reduce)

Anthropic prompt caching (cache_control)

Configuration

CLI

Architecture

Optional Dependencies

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance