Source-to-intelligence platform: turn YouTube, websites, and arXiv papers into a structured, reusable corpus with per-source insights, cross-source synthesis, and Deep Research reports.
Project description
Distill
Installed as distillr on PyPI; the CLI is distill.
Turn YouTube, websites, and arXiv papers into a structured, reusable corpus of insights, syntheses, and reports — all plain markdown on your disk.
pip install distillr
distill papers "temporal knowledge graph" --topic tkg --limit 20
That one command searches arXiv, downloads 20 PDFs, extracts full text, runs structured analysis on each, and writes a cross-paper synthesis. For a 20-paper run like the example below, expect single-digit minutes and roughly ~$1 in model spend. Terminal output during the run looks like this:
Papers: temporal knowledge graph
Topic: tkg | Selected papers: 20
[1/20] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge
Graphs and Agentic Memory
[2/20] Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
...
6m 47s ~$1.01 (391,278 in / 38,117 out)
paper.md 90.4 KB
insights.md 8.1 KB
...
paper_synthesis.md 11.8 KB
corpus_synthesis.md 10.5 KB
What you get
One local library/ directory of plain markdown. No database, no cloud lock-in, no proprietary format. Open it in any text editor, Obsidian, VS Code, or feed it into another tool.
Three source types, same pipeline shape (capture → analyze → synthesize → report):
- YouTube — channels, topic searches, videos, Shorts
- Websites — vendor sites, research hubs, curated URL sets (browser-first crawl with PDF/embedded-video ingestion)
- arXiv papers — phrase-matched search, full-PDF extraction, structured per-paper insights, cross-paper synthesis
Plus an MCP server so AI assistants and agent systems can query the library directly.
Quick start
pip install distillr
playwright install chromium # for YouTube search + website capture
distill doctor # verify API keys + system health
Set two keys in .env (copy from .env.example):
XAI_API_KEY=xai-... # Grok models
GEMINI_API_KEY=AIza... # Gemini Deep Research (reports + briefings)
Then try any of:
# Goal-aware cross-source discovery (papers + videos, reranked against a goal)
distill discover "help an AI become a great music composer" --topic music --preview
distill discover --goal-file private/my-goal.md --topic research --yes
# Get smart on a YouTube topic, fast
distill latest "Microsoft Fabric best practices" --limit 10 --report
# Discover and ingest arXiv papers — expands the query, LLM-reranks candidates,
# picks the top N (use --preview to see the shortlist without ingesting)
distill papers "agent memory systems" --topic memory --limit 20
distill papers "agent memory systems" --topic memory --limit 20 --preview
# Distill a vendor/research site
distill site-batch configs/example_seeds.json --topic example --seed-only
The full command reference lives in docs/usage.md.
Mental model
library/
└── topics/<topic>/
├── channels/<creator>/videos/<video>/
│ ├── transcript.txt
│ └── insights.md
├── sites/<hostname>/pages/<page>/
│ ├── content.md
│ └── insights.md
├── papers/<paper>/
│ ├── paper.md
│ └── insights.md
├── topic_synthesis.md # cross-source
└── corpus_synthesis.md # mixed-source view
You build a topic library over time. Ingest once, refresh on a cadence, generate a report or briefing when you need one.
See docs/outputs.md for what every artifact contains.
Sample output
A per-paper insights.md (excerpt):
---
paper_title: "Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs"
paper_id: 2604.11544v1
analyzed_by: grok-4.20-0309-reasoning
source_mode: full_pdf
---
### Core Contribution
1. Continuous functional rotation θ_r(τ) = s · α_r · τ · ω instead of discrete
timestamp lookup tables. Zero-shot interpolation of unseen dates.
2. Semantic Speed Gate: MLP that reads only text embedding ϕ(r) and outputs α_r.
Learns relational volatility from data.
3. Geometric shadowing in complex space: obsolete facts rotated out of phase so
the correct fact outranks contradictions via the scoring function alone.
### Methods and Evidence
- On ICEWS05-15, RoMem-ChronoR reaches 72.6 MRR (vs vanilla ChronoR 68.4).
- Zero-shot domain transfer to FinTMMBench: 0.728 MRR, 0.673 R@5.
- All baselines use identical answer LLM and judge for fairness.
### Limits and Open Questions
- Computational cost at millions-of-facts scale is motivation but no latency,
memory, or throughput numbers are reported.
- Gate pretrained only on ICEWS05-15 political events; generalization to
highly ambiguous relations is not quantified.
A cross-paper paper_synthesis.md (excerpt):
## Strongest Research Signals
- Append-only temporal representations improve long-horizon extrapolation:
RoMem (arXiv:2604.11544), EST (arXiv:2602.12389v3), and CID-TKG converge on
persistent or dual-view entity state over destructive overwriting, with
consistent MRR/Hits@K gains on ICEWS and GDELT.
- Semantic gating scales better than manual relation tagging: RoMem's Semantic
Speed Gate and EST's energy-barrier gate both learn relational volatility
from text embeddings rather than schema tags…
For multi-topic literature reviews, stakeholder briefings, or agent grounding, distill research-brief (Gemini Deep Research, web-augmented) and distill synthesize (Grok 4.20 single-call, corpus-only) take a user-written context file that shapes the output. See docs/usage.md#research-briefings-and-deep-synthesis.
Dashboard
distill # terminal home screen
distill serve # local web dashboard at http://127.0.0.1:8899
The terminal home screen shows tracked topics, channel and topic watches, recent runs, failures, and rolling spend. The web dashboard adds clickable drill-downs to per-topic, per-channel, and per-video views with rendered markdown, plus cost history and watchlist status. Both auto-refresh and read directly from library files — no database.
MCP server
Claude Desktop / Claude Code config:
{ "mcpServers": { "distill": { "command": "distill-mcp" } } }
Distill exposes 8 tools, 12 resources, and 4 prompts. See docs/mcp.md for the list.
Cost
Bulk video analysis is essentially free ($0.006/video). Gemini Deep Research dominates paid reports ($2–3/report). distill synthesize is ~$0.50 for a multi-topic corpus pass. Every run logs actual vs estimated cost to library/cost_log.jsonl; distill costs shows the history.
Full cost model in docs/cost.md.
Docs
docs/usage.md— full command referencedocs/architecture.md— data flow, 4-phase report pipeline, model routing, security hardeningdocs/outputs.md— what every artifact containsdocs/cost.md— cost model, examples, guardrailsdocs/mcp.md— MCP tools, resources, promptsdocs/briefing-contexts/TEMPLATE.md— starting point for--context-filepromptsprivate/README.md— where personal/client-specific files go (git-ignored)
Roadmap and changelog
docs/CHANGELOG.md— what shipped in0.1.0ROADMAP.md— what's next
Contributing
See docs/CONTRIBUTING.md for dev setup, quality gates, and scope. Security disclosures go through docs/SECURITY.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distillr-0.2.0.tar.gz.
File metadata
- Download URL: distillr-0.2.0.tar.gz
- Upload date:
- Size: 269.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23ee1a45f01c7e5787e0574790508a4409aa8f1ceffddced75727846968649c8
|
|
| MD5 |
519ab52c85422c2122280f75acc28999
|
|
| BLAKE2b-256 |
d9e7127ad8d98c3c1746437c5b1a28b2141f104d6821e5eb3c07b187a866c05d
|
Provenance
The following attestation bundles were made for distillr-0.2.0.tar.gz:
Publisher:
publish.yml on blisspixel/distillr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillr-0.2.0.tar.gz -
Subject digest:
23ee1a45f01c7e5787e0574790508a4409aa8f1ceffddced75727846968649c8 - Sigstore transparency entry: 1393618970
- Sigstore integration time:
-
Permalink:
blisspixel/distillr@f2182e37dfeab296b5b8d18264504f07f98669f5 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/blisspixel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2182e37dfeab296b5b8d18264504f07f98669f5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file distillr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: distillr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 202.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eac24da7f0df8a7a358d616fd539b0645045027bb1866e2d23d0e239a20d430
|
|
| MD5 |
0416133d2298a9255c5e66099b54d575
|
|
| BLAKE2b-256 |
74c320913cc25a5428856d63bbee9ff722017846ba6a52cff0081be91b246e01
|
Provenance
The following attestation bundles were made for distillr-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on blisspixel/distillr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillr-0.2.0-py3-none-any.whl -
Subject digest:
8eac24da7f0df8a7a358d616fd539b0645045027bb1866e2d23d0e239a20d430 - Sigstore transparency entry: 1393618975
- Sigstore integration time:
-
Permalink:
blisspixel/distillr@f2182e37dfeab296b5b8d18264504f07f98669f5 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/blisspixel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f2182e37dfeab296b5b8d18264504f07f98669f5 -
Trigger Event:
push
-
Statement type: