Personal content digester framework: fetch → extract → LLM summarize → sink
Project description
English | 日本語
digestkit
Personal content digester framework: fetch → extract → LLM summarize → sink
Installation
Note: digestkit is not yet published to PyPI. Until the first PyPI release, install directly from the umbrella repository's
mainbranch using a git URL.
From git (current)
pip install "digestkit @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"
With optional extras:
pip install "digestkit[pdf,notion] @ git+https://github.com/koki-nakamura22/inboxkit.git@main#subdirectory=packages/digestkit"
For uv projects, declare it under [tool.uv.sources]:
[project]
dependencies = ["digestkit"]
[tool.uv.sources]
digestkit = { git = "https://github.com/koki-nakamura22/inboxkit.git", subdirectory = "packages/digestkit", branch = "main" }
Pin to a specific commit for reproducibility by replacing branch = "main" with rev = "<sha>".
From PyPI (planned)
Once digestkit is published, the standard install path will be:
pip install digestkit
pip install digestkit[pdf,notion]
Tracking issue: #3.
Quickstart
Create a .env file with your LLM API key:
ANTHROPIC_API_KEY=sk-ant-...
Define and run your digester:
from digestkit import Digester
from digestkit.sources import LocalDirectorySource
from digestkit.extractors import PDFExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks import SQLiteSink
class PdfDigester(Digester):
source = LocalDirectorySource("./papers", glob="*.pdf")
extractor = PDFExtractor()
summarizer = LLMSummarizer(provider="anthropic", model="claude-haiku-4-5")
sink = SQLiteSink("digests.db")
if __name__ == "__main__":
PdfDigester().run()
Programmatic construction (constructor injection)
For dynamic configuration (config files, CLI flags, tests with swapped components), pass the four core dependencies as constructor kwargs instead of subclassing:
digester = Digester(
source=LocalDirectorySource("./papers", glob="*.pdf"),
extractor=PDFExtractor(),
summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
sink=SQLiteSink("digests.db"),
)
digester.run()
Both styles are supported and can be mixed: when a subclass defines class attributes,
any kwarg passed to __init__ overrides them (kwarg wins). This is the same hybrid
pattern used by seen_store and dedup_key.
Notion DB → web fetch → summarize → Slack
A common pipeline: walk a Notion database's URL property, fetch + summarize
each page, post to Slack. Specifying NotionDatabaseSource(url_property=...)
makes item.payload a URL string so WebPageExtractor connects directly
(the original Notion page object remains available at item.metadata["page"]):
from digestkit import Digester
from digestkit.sources.notion_database import NotionDatabaseSource
from digestkit.extractors.webpage import WebPageExtractor
from digestkit.summarizers import LLMSummarizer
from digestkit.sinks.slack import SlackSink
digester = Digester(
source=NotionDatabaseSource(
database_id="<your-db-id>",
url_property="URL", # ← payload を URL 文字列にするモード
status_property="Status",
status_value_success="処理済み",
query_filter={"property": "Status", "select": {"equals": "未読"}},
),
extractor=WebPageExtractor(),
summarizer=LLMSummarizer(provider="anthropic", model="claude-haiku-4-5"),
sink=SlackSink(webhook_url="https://hooks.slack.com/..."),
)
digester.run()
NotionDatabaseSource transparently handles Notion 3.x's Data Sources API
(data_sources/{id}/query). The first fetch() makes a single
databases.retrieve call to detect whether data_sources are present and
caches the result on the instance (no further retrieve calls after that).
DBs created on 3.x use the new API; legacy DBs fall back to the older
databases/{id}/query automatically — callers don't need to think about
API versions.
Long documents (chunked / map-reduce)
For documents that exceed a model's context window (long PDFs, book chapters), use
ChunkedLLMSummarizer. It splits the input, summarizes each chunk (map), and
recursively merges the partial summaries (reduce). Inputs that fit in the window
fall back to a single LLM call automatically.
from digestkit.summarizers import ChunkedLLMSummarizer
summarizer = ChunkedLLMSummarizer(
provider="anthropic",
model="claude-haiku-4-5",
chunk_size=80_000, # tokens per chunk; defaults to model max - reserve_tokens
chunk_overlap=0,
prompts=ChunkedLLMSummarizer.DEFAULT_PROMPTS, # opt-in length control
default_length="standard",
)
length ("short" / "standard" / "detailed") is applied only at the final
reduce step; intermediate stages use a neutral merge prompt to avoid
over-compressing mid-pipeline. On a per-chunk LLM failure the call fails fast with
the chunk index in the error message.
Anthropic prompt caching (cache_control)
LLMSummarizer.system_prompt accepts either a str or a list of LiteLLM
content blocks (list[dict]). The list form lets you enable Anthropic
prompt caching (cache_control: {"type": "ephemeral"}) so that the input
tokens of a long system prompt are billed at the cache-hit rate:
from digestkit.summarizers import LLMSummarizer
summarizer = LLMSummarizer(
provider="anthropic",
model="claude-sonnet-4-6",
system_prompt=[
{
"type": "text",
"text": "<long system prompt (JSON schema, output examples, ...)>",
"cache_control": {"type": "ephemeral"},
},
],
)
Passing system_prompt as a plain str keeps the previous behavior
(backward compatible).
If you just want to cache the entire system prompt without dealing with
content blocks yourself, use the system_prompt_cache=True shortcut:
summarizer = LLMSummarizer(
provider="anthropic",
model="claude-sonnet-4-6",
system_prompt="<long system prompt>",
system_prompt_cache=True, # auto-wraps the str in an ephemeral cache_control block
)
For finer control (caching only some of several blocks, etc.), use the list
form shown above. system_prompt_cache=True and the list form are mutually
exclusive (they fight over control of cache_control).
See the LiteLLM docs for details: https://docs.litellm.ai/docs/providers/anthropic#prompt-caching
Configuration
Set your LLM provider API key in a .env file (loaded automatically via python-dotenv):
ANTHROPIC_API_KEY=sk-ant-... # Anthropic Claude
OPENAI_API_KEY=sk-... # OpenAI GPT
GOOGLE_API_KEY=... # Google Gemini
CLI
digestkit run my_digester.py
Architecture
digestkit implements a 1:1 pipeline: for each item fetched from the source, it extracts
text, sends it to an LLM for summarization, and writes the result to the configured sink.
Items that fail at any stage are collected in RunResult.failures; the pipeline continues
rather than aborting on first error.
digestkit is the Phase 1 component of the inboxkit umbrella monorepo, which will also host future packages for RAG ingestion and personal knowledge bases.
Optional Dependencies
| Extra | Packages | Use case |
|---|---|---|
pdf |
pypdf | Extract text from PDF files |
web |
trafilatura, httpx | Fetch and extract web articles |
notion |
notion-client | Fetch pages from Notion |
slack |
httpx | Fetch messages from Slack |
email |
— | Fetch emails (IMAP/SMTP) |
all |
all of the above | Install everything |
Install any extra with pip install digestkit[<extra>].
Contributing
See the umbrella CONTRIBUTING.md for development setup, lint / format / typecheck targets, and the pre-commit hook.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file digestkit-0.1.0.tar.gz.
File metadata
- Download URL: digestkit-0.1.0.tar.gz
- Upload date:
- Size: 72.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4cd5c40d3be012cf2614d066e4b4f6e5232b40f4f7ec2d7acb857cf479ced8d
|
|
| MD5 |
fb1d0582fd39ff9dcc7c66d809cb84be
|
|
| BLAKE2b-256 |
ae836002acee86a9c21a14e7b32ef1050ef79bbfccbe2cecfbd68d528f062d92
|
Provenance
The following attestation bundles were made for digestkit-0.1.0.tar.gz:
Publisher:
publish.yml on koki-nakamura22/inboxkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
digestkit-0.1.0.tar.gz -
Subject digest:
b4cd5c40d3be012cf2614d066e4b4f6e5232b40f4f7ec2d7acb857cf479ced8d - Sigstore transparency entry: 1561983389
- Sigstore integration time:
-
Permalink:
koki-nakamura22/inboxkit@a6316e1f385b2c08bf72f48ca106add1d3284f7e -
Branch / Tag:
refs/tags/digestkit-v0.1.0 - Owner: https://github.com/koki-nakamura22
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6316e1f385b2c08bf72f48ca106add1d3284f7e -
Trigger Event:
push
-
Statement type:
File details
Details for the file digestkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: digestkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cc1eee58508d94f75028c2b710256cd9c2449ebac66c21bed50b18b42f69b6c
|
|
| MD5 |
7a403408406505641ed05aef42e17715
|
|
| BLAKE2b-256 |
2dd23cdb51b9b0ef82a96a78d9623f1a2413b1ec659773ccd8ebb420d44d085c
|
Provenance
The following attestation bundles were made for digestkit-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on koki-nakamura22/inboxkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
digestkit-0.1.0-py3-none-any.whl -
Subject digest:
2cc1eee58508d94f75028c2b710256cd9c2449ebac66c21bed50b18b42f69b6c - Sigstore transparency entry: 1561983518
- Sigstore integration time:
-
Permalink:
koki-nakamura22/inboxkit@a6316e1f385b2c08bf72f48ca106add1d3284f7e -
Branch / Tag:
refs/tags/digestkit-v0.1.0 - Owner: https://github.com/koki-nakamura22
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a6316e1f385b2c08bf72f48ca106add1d3284f7e -
Trigger Event:
push
-
Statement type: