No project description provided
Project description
kabigon
A Python library that extracts content from URLs and converts the result to text or markdown. Feed it a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon picks the right loader automatically.
Features
- Smart routing — recognises YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDFs, and generic web pages, then selects the best extraction pipeline
- Automatic fallback — if the primary loader fails, remaining loaders are tried in order without repeating work
- Async-first — built on
async/await; a synchronous wrapper is provided for convenience - Single-line API —
kabigon.load_url_sync(url)is all you need to get started - Extensible — add a new loader by subclassing
Loaderand implementing one method
Table of Contents
- Installation
- Quick Start
- CLI
- Python API
- Supported Sources
- Architecture
- Configuration
- Troubleshooting
- Development
- License
Installation
# Install as a CLI tool
uv tool install kabigon
# Or run directly without installing
uvx kabigon <url>
After installation, install a browser for Playwright (required for generic web scraping):
playwright install chromium
Quick Start
import kabigon
# One line to load any URL
text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)
CLI
# Load content from a URL (auto-selects the best loader)
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
# List all available loaders
kabigon --list
# Use a specific loader (or a comma-separated chain)
kabigon --loader youtube https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon --loader youtube,playwright https://www.youtube.com/watch?v=dQw4w9WgXcQ
Without --loader, kabigon routes the URL to a source-specific pipeline first, then falls back to the remaining default loaders without repeating already-attempted ones.
More examples:
kabigon https://x.com/elonmusk/status/123456789
kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
kabigon https://reddit.com/r/python/comments/xyz/...
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf
Python API
Sync
import kabigon
# Automatic loader selection
text = kabigon.load_url_sync("https://www.google.com")
print(text)
Async
import asyncio
import kabigon
async def main() -> None:
text = await kabigon.load_url("https://www.google.com")
print(text)
# Parallel batch loading
urls = [
"https://x.com/user/status/123",
"https://youtube.com/watch?v=abc",
"https://reddit.com/r/python/comments/xyz",
]
results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
for url, content in zip(urls, results, strict=True):
print(f"{url}: {len(content)} chars")
asyncio.run(main())
Custom Loader Chains
Use Compose to build a custom pipeline that tries loaders in order:
from kabigon.loaders import Compose, TwitterLoader, YoutubeLoader, PlaywrightLoader
loader = Compose([
TwitterLoader(),
YoutubeLoader(),
PlaywrightLoader(), # generic fallback
])
text = loader.load_sync("https://x.com/user/status/123")
Utility Functions
import kabigon
# Show which loaders kabigon would use for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)
# List all registered loader names
loaders = kabigon.available_loaders()
print(loaders)
API Summary
| Style | One-liner | Custom chain |
|---|---|---|
| Sync | kabigon.load_url_sync(url) |
loader.load_sync(url) |
| Async | await kabigon.load_url(url) |
await loader.load(url) |
| Batch | await asyncio.gather(*[kabigon.load_url(u) for u in urls]) |
await asyncio.gather(*[loader.load(u) for u in urls]) |
Supported Sources
| Source | Loader | Notes |
|---|---|---|
| YouTube | YoutubeLoader |
Transcript extraction via youtube-transcript-api |
| YouTube | YoutubeYtdlpLoader |
Audio download + Whisper transcription |
| Twitter / X | TwitterLoader |
Supports x.com, fxtwitter.com, vxtwitter.com, and others |
| Truth Social | TruthSocialLoader |
Post content extraction |
RedditLoader |
Posts and comments; auto-redirects to old.reddit.com |
|
| Instagram Reels | ReelLoader |
Audio transcription via yt-dlp + Whisper |
| GitHub | GitHubLoader |
File content from github.com/.../blob/... and raw.githubusercontent.com |
| BBC | BBCLoader |
Article-aware HTML parsing |
| CNN | CNNLoader |
Article-aware HTML parsing |
PDFLoader |
Text extraction from remote or local PDF files | |
| PTT | PttLoader |
Taiwan PTT (BBS) forum posts |
| Generic web | PlaywrightLoader |
Full browser rendering via Playwright |
| Generic web | HttpxLoader |
Lightweight HTTP fetch + HTML-to-markdown |
| Generic web | FirecrawlLoader |
Web extraction via the Firecrawl API |
| Audio / Video | YtdlpLoader |
Generic audio transcription via yt-dlp + Whisper |
Architecture
kabigon follows a layered architecture:
Interface (CLI) → Application (routing, strategy, planning) → Domain (Loader ABC, models, errors)
↓
Loaders (concrete implementations)
Request flow:
- The URL enters via the CLI or
load_url(). routing.pymatches the URL against known patterns (YouTube, Twitter, …) to select a source-specific pipeline.strategy.py+planner.pybuild aLoaderPlan— the primary loaders followed by fallback loaders (de-duplicated).executor.pyinstantiates the loaders;Composeruns them in sequence and returns the first successful result.
To add a new loader, create a file in src/kabigon/loaders/, subclass Loader, implement async def load(self, url: str) -> str, register it in infrastructure/registry.py, and add a routing rule if the loader handles a specific domain.
Configuration
Environment Variables
| Variable | Purpose |
|---|---|
FFMPEG_PATH |
Custom path to the FFmpeg binary (used by Whisper / yt-dlp audio transcription) |
FIRECRAWL_API_KEY |
API key for the Firecrawl loader |
Docker
A Dockerfile is provided for containerised usage:
docker build -t kabigon .
# "kabigon" after the image name is the CLI command
docker run --rm kabigon kabigon https://example.com
The image includes Playwright with Chromium and uses xvfb-run for headless browser rendering.
Troubleshooting
Playwright browser not installed
Executable doesn't exist at /path/to/chromium
Install the browser after installing kabigon:
playwright install chromium
FFmpeg not found
ffmpeg not found
Install FFmpeg:
# Ubuntu / Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
Or point to a custom binary:
export FFMPEG_PATH=/path/to/ffmpeg
Timeout errors
Timeout 30000ms exceeded
Increase the timeout for slow-loading pages:
from kabigon.loaders import PlaywrightLoader
loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)
CAPTCHA / rate limiting
Some websites block automated access. kabigon automatically uses old.reddit.com for Reddit to avoid CAPTCHAs. For other sites, consider adding delays between requests or implementing retry logic.
Development
Setup
git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium
Testing
# Full suite with coverage
uv run pytest -v -s --cov=src tests
# Single file
uv run pytest -v -s tests/loaders/test_youtube.py
# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name
Linting and Type Checking
uv run ruff check . # lint
uv run ruff format . # format
uv run ty check . # type check
uv run ruff check --fix . # auto-fix lint issues
Building and Publishing
uv build -f wheel
uv publish
Adding a New Loader
- Create
src/kabigon/loaders/<source>.pyand subclassLoader. - Implement
async def load(self, url: str) -> str. - Export the class from
src/kabigon/loaders/__init__.py. - Register the loader in
src/kabigon/infrastructure/registry.py. - Add a URL-matching rule in
src/kabigon/application/routing.py(if domain-specific). - Add tests in
tests/loaders/.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kabigon-0.18.1-py3-none-any.whl.
File metadata
- Download URL: kabigon-0.18.1-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd9451f11b3f21a122d343a61f45387febd0e365769ee6498c5f49556e96ef11
|
|
| MD5 |
d59563cd7846993e1d0920b70dd7cbba
|
|
| BLAKE2b-256 |
e54d8d87b7d954ae16386587081a56c08db4a62202c6752a3e10d87045f7e6e0
|