Source-to-intelligence platform: turn YouTube, websites, and arXiv papers into a structured, reusable corpus with per-source insights, cross-source synthesis, and Deep Research reports.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

blisspixel

These details have not been verified by PyPI

Project description

Distill

Installed as distillr on PyPI; the CLI is distill.

Turn YouTube, websites, and arXiv papers into a durable, AI-ready research corpus — all plain Markdown on your disk, with stable filenames, YAML metadata, and source receipts.

pip install distillr
distill papers "temporal knowledge graph" --topic tkg --limit 20

That one command searches arXiv, downloads 20 PDFs, extracts full text, runs structured analysis on each, and writes a cross-paper synthesis. For a 20-paper run like the example below, expect single-digit minutes and under a dollar in model spend on the grok-4.3 default. Terminal output during the run looks like this:

Papers: temporal knowledge graph
Topic: tkg | Selected papers: 20

  [1/20] Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge
         Graphs and Agentic Memory
  [2/20] Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
  ...

  6m 47s  ~$0.58 (391,278 in / 38,117 out)

  time_is_not_a_label_260411544_Paper.md     90.4 KB
  time_is_not_a_label_260411544_Insights.md   8.1 KB
  ...
  tkg_Paper_Synthesis.md  11.8 KB
  tkg_Corpus_Synthesis.md 10.5 KB

Why not just ask Deep Research?

ChatGPT, Gemini Deep Research, and Perplexity are excellent oracles: ask a question, get an answer. Distill is an engine. It automates the tedious ingestion layer, keeps the raw transcripts and paper text next to the analysis, and turns each run into a permanent local corpus that future tools can reuse.

That matters when you are doing thesis work, competitive analysis, technical due diligence, or building a startup knowledge base. You can verify the receipts, refresh the corpus over time, query it through MCP from Claude Desktop / Cursor / other agents, and open the same Markdown folder in Obsidian, Logseq, VS Code, Notion import, or plain filesystem search.

What you get

One local library/ directory of plain Markdown. No database, no cloud lock-in, no proprietary format. Files use globally descriptive names plus YAML frontmatter so knowledge-base tools, Dataview-style plugins, and AI coding assistants can understand them without guessing from generic insights.md tabs.

Four source types, same pipeline shape (capture -> analyze -> synthesize -> report):

YouTube — channels, topic searches, videos, Shorts
Websites — vendor sites, research hubs, curated URL sets (browser-first crawl with PDF/embedded-video ingestion)
arXiv papers — phrase-matched search, full-PDF extraction, structured per-paper insights, cross-paper synthesis
X (Twitter) posts — via distill ingest <tweet-url>; uses the public syndication embed endpoint (no anti-bot scraping). When a tweet has a native video attachment, the audio is transcribed via local-first Whisper (faster-whisper on GPU/CPU, OpenAI Whisper as cloud fallback) with a vocabulary hint derived from the source metadata to keep proper nouns intact.

Plus an MCP server so AI assistants and agent systems can query the library directly.

Quick start

pip install distillr
playwright install chromium     # for YouTube search + website capture
distill doctor                  # verify API keys + system health

Set two keys in .env (copy from .env.example):

XAI_API_KEY=xai-...             # Grok models
GEMINI_API_KEY=AIza...          # Gemini Deep Research (reports + briefings)

Or run locally with Ollama (no API keys needed for ingestion):

ollama pull qwen3.5:27b         # download recommended model for 24GB GPU
echo "DISTILL_PROVIDER=ollama" >> .env
distill doctor                  # verify local setup

Then try any of:

# Goal-aware cross-source discovery (papers + videos + curated sites, reranked against a goal)
distill discover "help an AI become a great music composer" --topic music --preview
distill discover --goal-file private/my-goal.md --topic research --yes
distill discover --goal-file private/agent365-goal.md --topic agent365 --site-seeds private/agent365_sites.json --site-limit 10 --preview

# Get smart on a YouTube topic, fast
distill latest "Microsoft Fabric best practices" --limit 10 --report

# Discover and ingest arXiv papers — expands the query, LLM-reranks candidates,
# picks the top N (use --preview to see the shortlist without ingesting)
distill papers "agent memory systems" --topic memory --limit 20
distill papers "agent memory systems" --topic memory --limit 20 --preview

# Distill a vendor/research site
distill site-batch configs/example_seeds.json --topic example --seed-only

The full command reference lives in docs/usage.md.

Mental model

library/
  └── topics/<topic>/
       ├── channels/<creator>/videos/<video>/
       │     ├── <video-slug>_Transcript.txt
       │     └── <video-slug>_Insights.md
       ├── sites/<hostname>/pages/<page>/
       │     ├── <page-slug>_Content.md
       │     └── <page-slug>_Insights.md
       ├── papers/<paper>/
       │     ├── <paper-slug>_Paper.md
       │     └── <paper-slug>_Insights.md
       ├── <topic>_Topic_Synthesis.md      # cross-source
       └── <topic>_Corpus_Synthesis.md     # mixed-source view

You build a topic library over time. Ingest once, refresh on a cadence, generate a report or briefing when you need one. Older insights.md-style libraries are still readable, but new Markdown writes use the stable knowledge-base naming scheme.

See docs/outputs.md for what every artifact contains.

Sample output

A cross-paper <topic>_Paper_Synthesis.md (excerpt):

## Strongest Research Signals

- Append-only temporal representations improve long-horizon extrapolation:
  RoMem (arXiv:2604.11544), EST (arXiv:2602.12389v3), and CID-TKG converge on
  persistent or dual-view entity state over destructive overwriting, with
  consistent MRR/Hits@K gains on ICEWS and GDELT.

- Semantic gating scales better than manual relation tagging: RoMem's Semantic
  Speed Gate and EST's energy-barrier gate both learn relational volatility
  from text embeddings rather than schema tags…

Per-paper <paper-slug>_Insights.md excerpt (click to expand)

---
title: "Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs"
type: "insights"
topic: "tkg"
source: "arxiv"
source_id: "2604.11544v1"
url: "https://arxiv.org/abs/2604.11544v1"
authors: ["Alice Example", "Bob Example"]
tags: ["distill/tkg", "source/arxiv", "cs.AI"]
synthesis_scope: "single-paper"
analyzed_by: grok-4.20-0309-reasoning
source_mode: full_pdf
---

### Core Contribution
1. Continuous functional rotation θ_r(τ) = s · α_r · τ · ω instead of discrete
   timestamp lookup tables. Zero-shot interpolation of unseen dates.
2. Semantic Speed Gate: MLP that reads only text embedding ϕ(r) and outputs α_r.
   Learns relational volatility from data.
3. Geometric shadowing in complex space: obsolete facts rotated out of phase so
   the correct fact outranks contradictions via the scoring function alone.

### Methods and Evidence
- On ICEWS05-15, RoMem-ChronoR reaches 72.6 MRR (vs vanilla ChronoR 68.4).
- Zero-shot domain transfer to FinTMMBench: 0.728 MRR, 0.673 R@5.
- All baselines use identical answer LLM and judge for fairness.

### Limits and Open Questions
- Computational cost at millions-of-facts scale is motivation but no latency,
  memory, or throughput numbers are reported.
- Gate pretrained only on ICEWS05-15 political events; generalization to
  highly ambiguous relations is not quantified.

For multi-topic literature reviews, stakeholder briefings, or agent grounding, distill research-brief (Gemini Deep Research, web-augmented) and distill synthesize (Grok 4.20 single-call, corpus-only) take a user-written context file that shapes the output. See docs/usage.md#research-briefings-and-deep-synthesis.

Dashboard

distill                         # terminal home screen
distill serve                   # local web dashboard at http://127.0.0.1:8899

The terminal home screen shows tracked topics, channel and topic watches, recent runs, failures, and rolling spend. The web dashboard adds clickable drill-downs to per-topic, per-channel, and per-video views with rendered markdown, plus cost history and watchlist status. Both auto-refresh and read directly from library files — no database.

MCP server, and agent-discoverable directories

Distillr is built for two parallel agent-integration paths:

Path 1 — MCP (structured queries). Claude Desktop / Claude Code config:

{ "mcpServers": { "distill": { "command": "distill-mcp" } } }

Distill exposes 22 tools, 12 resources, and 4 prompts. See docs/mcp.md for the list.

Path 2 — file system (the corpus IS the interface). When a coding agent cds into library/topics/<your-topic>/, the directory is plain Markdown with stable filenames and YAML frontmatter, so grep, cat, ls, and find are first-class query primitives — no schema to learn, no MCP setup required. From 0.8.4 forward, every topic directory ships an auto-generated CLAUDE.md orientation file that agents which auto-load it (Claude Code, Cursor, others) pick up automatically. This matches what Anthropic's Agent SDK material recommends for agent design: file system + composable tools as the substrate, with structured APIs layered on top when they help, not as the only entry point.

Cost

On the grok-4.3 default ($1.25/$2.50 per 1M tokens), bulk video analysis runs ~$0.03/video and a full paper ~~$0.03; Gemini Deep Research dominates paid reports (~~$2–3/report); distill synthesize is ~$0.20–0.40 for a multi-topic corpus pass. grok-4.3 is the cloud floor — xAI retired the cheaper fast tiers (grok-4-1-fast etc.) on 2026-05-15, and those slugs now redirect to grok-4.3 and bill at grok-4.3 rates (migration guide). The only cheaper path is running analysis on a local model (Ollama/LM Studio) — distill eval --models grok-4.3,<local-model> measures the cost × quality tradeoff over frozen fixtures and recommends the cheapest model that clears your quality bar before you switch. Every run logs actual vs estimated cost to cost_log.jsonl, and the pre-run estimate self-calibrates against that history; distill costs shows it.

Full cost model in docs/cost.md.

Docs

docs/usage.md — full command reference
docs/invariants.md — design charter: what distill is, is not, and the rules that don't bend
docs/architecture.md — data flow, 4-phase report pipeline, model routing, security hardening
docs/outputs.md — what every artifact contains
docs/cost.md — cost model, examples, guardrails
docs/mcp.md — MCP tools, resources, prompts
docs/migration-grok-4.3.md — Grok 4.3 migration guide (model retirement May 15, 2026)
docs/briefing-contexts/TEMPLATE.md — starting point for --context-file prompts
private/README.md — where personal/client-specific files go (git-ignored)

Roadmap and changelog

docs/CHANGELOG.md — what shipped
ROADMAP.md — what's next

Recent: 0.8.7 Security hardening (shipped 2026-05-30). Indirect-prompt-injection resistance threaded into every per-source analysis prompt (ingested sources are treated as untrusted data, not instructions), and the local dashboard's rendered HTML is now sanitized (nh3) to close a stored-XSS vector — see the Security posture section for the threat model and what's deliberately out of scope. Recent work also includes 0.8.4 Agent-discoverable library (auto-generated per-topic and library-root CLAUDE.md; distill claude-md --all backfills — see ROADMAP.md) and 0.8.5–0.8.6 (Gemini model refresh + cost-tracking completeness).

Recent: 0.9.1–0.9.4 Discovery-loop close-out (shipped 2026-06-01). The pre-run spend estimate now scales per-video cost by duration and self-calibrates against cost_log.jsonl history, reporting an honest range (0.9.1); discover --preview saves the exact ranked shortlist under an id so discover --from-preview <id> ingests precisely what you previewed (0.9.2); on a fresh topic discover leads with a size-then-approve menu — Excellent / Including good / Everything worthwhile, each with its own spend — instead of silently auto-ingesting (0.9.3); and --rigor now works across discover/papers/latest on per-command calibrated thresholds (0.9.4). See docs/CHANGELOG.md.

Also recent: 0.9.6–0.9.7 distill eval — a cost × quality model-selection harness: sweep candidate models over frozen fixtures and get a deterministic recommendation for the cheapest model that clears your quality bar, with a pairwise order-randomized judge (advisory, bias-cancelled) setting a confidence flag, verbosity-resistant scoring, 3 fixtures/workload, and a drift-tracking results log. The way to decide if a free local model can replace the grok-4.3 cloud floor. See docs/usage.md.

Next: source breadth + audio capability — the five-adapter set (podcasts, GitHub repos, generic audio/video files, Substack, X hardening — see ROADMAP.md). Then the self-maintaining audit (distill audit bundles the health/link/gap checks into one report + action menu), and 0.10 (operational polish + run-time verify hook + the distill ask output->input loop + sub-agent-friendly MCP tools).

Contributing

See docs/CONTRIBUTING.md for dev setup, quality gates, and scope. Security disclosures go through docs/SECURITY.md.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

blisspixel

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.12.13

Jun 12, 2026

0.12.12

Jun 12, 2026

0.12.11

Jun 12, 2026

0.12.10

Jun 12, 2026

0.12.9

Jun 12, 2026

0.12.8

Jun 12, 2026

0.12.7

Jun 12, 2026

0.12.6

Jun 12, 2026

0.12.5

Jun 12, 2026

0.12.4

Jun 12, 2026

0.12.3

Jun 12, 2026

0.12.2

Jun 12, 2026

0.12.1

Jun 12, 2026

0.12.0

Jun 12, 2026

0.11.2

Jun 12, 2026

0.11.1

Jun 11, 2026

0.11.0

Jun 11, 2026

0.10.2

Jun 11, 2026

0.10.1

Jun 11, 2026

0.10.0

Jun 11, 2026

0.9.31

Jun 11, 2026

0.9.30

Jun 11, 2026

0.9.29

Jun 11, 2026

0.9.28

Jun 11, 2026

0.9.27

Jun 11, 2026

0.9.26

Jun 11, 2026

0.9.25

Jun 9, 2026

0.9.24

Jun 9, 2026

0.9.23

Jun 7, 2026

0.9.22

Jun 6, 2026

0.9.21

Jun 6, 2026

0.9.20

Jun 6, 2026

This version

0.9.19

Jun 6, 2026

0.9.18

Jun 6, 2026

0.9.17

Jun 6, 2026

0.9.16

Jun 6, 2026

0.9.15

Jun 6, 2026

0.9.14

Jun 6, 2026

0.9.13

Jun 1, 2026

0.9.12

Jun 1, 2026

0.9.11

Jun 1, 2026

0.9.10

Jun 1, 2026

0.9.9

Jun 1, 2026

0.9.8

Jun 1, 2026

0.9.7

Jun 1, 2026

0.9.6

Jun 1, 2026

0.9.5

Jun 1, 2026

0.9.4

Jun 1, 2026

0.9.0

May 30, 2026

0.8.12

May 30, 2026

0.8.11

May 30, 2026

0.8.10

May 30, 2026

0.8.9

May 30, 2026

0.8.8

May 30, 2026

0.8.7

May 30, 2026

0.8.6

May 30, 2026

0.8.5

May 30, 2026

0.8.4

May 30, 2026

0.8.3

May 30, 2026

0.8.2

May 30, 2026

0.8.1

May 16, 2026

0.8.0.3

May 16, 2026

0.8.0.2

May 15, 2026

0.8.0.1

May 15, 2026

0.8.0

May 15, 2026

0.7.2

May 15, 2026

0.7.1

May 8, 2026

0.7.0

May 8, 2026

0.6.1

May 7, 2026

0.6.0

May 7, 2026

0.5.0

May 5, 2026

0.4.0

May 4, 2026

0.3.2

May 3, 2026

0.3.1

May 3, 2026

0.3.0

Apr 28, 2026

0.2.0

Apr 27, 2026

0.1.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distillr-0.9.19.tar.gz (388.4 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

distillr-0.9.19-py3-none-any.whl (478.9 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file distillr-0.9.19.tar.gz.

File metadata

Download URL: distillr-0.9.19.tar.gz
Upload date: Jun 6, 2026
Size: 388.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillr-0.9.19.tar.gz
Algorithm	Hash digest
SHA256	`7f5208c352a79ffd182488fa51d58e736a7797d8a7c805a9950f2ff33bf4e922`
MD5	`74544336f3caa6f8b9cdf9b397d08bcc`
BLAKE2b-256	`53e45093d115d8f670f41c2f3e1c9d9f5ee71ff25e32b4f32a2e1e8d94c324c7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillr-0.9.19.tar.gz:

Publisher: publish.yml on blisspixel/distillr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: distillr-0.9.19.tar.gz
- Subject digest: 7f5208c352a79ffd182488fa51d58e736a7797d8a7c805a9950f2ff33bf4e922
- Sigstore transparency entry: 1740754588
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: blisspixel/distillr@8e1f58cf6be6e8b9e3ffbf9c294599516646a7ed
- Branch / Tag: refs/tags/v0.9.19
- Owner: https://github.com/blisspixel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8e1f58cf6be6e8b9e3ffbf9c294599516646a7ed
- Trigger Event: push

File details

Details for the file distillr-0.9.19-py3-none-any.whl.

File metadata

Download URL: distillr-0.9.19-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 478.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillr-0.9.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0b87430e7950ab1d6b67301224d7d16b05c6f445fd90bac87f39a730336b48e`
MD5	`8c32cd8f5cc181a91601cd9e5451a1eb`
BLAKE2b-256	`9eea5edf23ae7d29b7a8d8c8700b6314cdb066e3687be9ec44db789fd1ff3626`

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillr-0.9.19-py3-none-any.whl:

Publisher: publish.yml on blisspixel/distillr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: distillr-0.9.19-py3-none-any.whl
- Subject digest: a0b87430e7950ab1d6b67301224d7d16b05c6f445fd90bac87f39a730336b48e
- Sigstore transparency entry: 1740754616
- Sigstore integration time: Jun 6, 2026
Source repository:
- Permalink: blisspixel/distillr@8e1f58cf6be6e8b9e3ffbf9c294599516646a7ed
- Branch / Tag: refs/tags/v0.9.19
- Owner: https://github.com/blisspixel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8e1f58cf6be6e8b9e3ffbf9c294599516646a7ed
- Trigger Event: push

distillr 0.9.19

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Distill

Why not just ask Deep Research?

What you get

Quick start

Mental model

Sample output

Dashboard

MCP server, and agent-discoverable directories

Cost

Docs

Roadmap and changelog

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance