No project description provided
Project description
kabigon
A Python library and CLI tool that extracts content from URLs and returns plain text or markdown. Point it at a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon selects the right loader automatically.
Intended for developers and data engineers who need reliable, source-aware text extraction without writing per-site scraping logic.
Features
- Automatic loader selection for YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDF, and generic web pages
- Fallback chain: if the primary loader fails, remaining loaders are tried in order without repeating already-attempted ones
- Async-first (
async/await) with a synchronous wrapper for scripts and notebooks - Single-line Python API:
kabigon.load_url_sync(url) - CLI for ad-hoc extraction and debugging
- Extensible: add a loader by subclassing
Loaderand implementing one method
Requirements
- Python 3.12+
- Playwright Chromium browser (for generic web scraping)
- FFmpeg (only for audio/video transcription loaders)
FIRECRAWL_API_KEYenvironment variable (only for the Firecrawl loader)
Installation
# Install as a CLI tool
uv tool install kabigon
# Or run directly without installing
uvx kabigon <url>
After installation, install the Chromium browser for Playwright:
playwright install chromium
Quick Start
import kabigon
text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)
Usage
CLI
# Auto-select the best loader
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon https://x.com/user/status/123456789
kabigon https://reddit.com/r/python/comments/xyz/
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf
Python — sync
import kabigon
text = kabigon.load_url_sync("https://www.google.com")
print(text)
Python — async
import asyncio
import kabigon
async def main() -> None:
text = await kabigon.load_url("https://www.google.com")
print(text)
asyncio.run(main())
Parallel batch loading
import asyncio
import kabigon
async def main() -> None:
urls = [
"https://x.com/user/status/123",
"https://youtube.com/watch?v=abc",
"https://reddit.com/r/python/comments/xyz",
]
results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
for url, content in zip(urls, results, strict=True):
print(f"{url}: {len(content)} chars")
asyncio.run(main())
API Reference
All public functions are importable from the kabigon package.
| Function | Signature | Description |
|---|---|---|
load_url_sync |
(url: str) -> str |
Load a URL synchronously using automatic loader selection |
load_url |
async (url: str) -> str |
Load a URL asynchronously using automatic loader selection |
available_loaders |
() -> list[str] |
Return names of all registered loaders |
explain_plan |
(url: str) -> dict[str, object] |
Return the planned loader chain for a URL without executing it |
import kabigon
# Inspect which loaders would be used for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)
# List all loader names
print(kabigon.available_loaders())
Architecture
The automatic path uses kabigon.pipelines to select a source-aware pipeline, then kabigon.load_chain builds one ordered execution plan. Each loader is constructed only when its turn is reached; the first non-empty string is returned, and if every planned loader fails, kabigon raises LoaderError with the attempted loader details.
Mermaid source: docs/architecture/url-processing.mmd
Commands
kabigon <url>
Load content from a URL. Automatically selects the best loader.
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon --list
Print all available loaders and their descriptions.
kabigon --list
kabigon --loader <names> <url>
Override automatic loader selection with a comma-separated list of loader names, tried in order.
kabigon --loader twitter,playwright https://x.com/user/status/123
Use this only for debugging or testing specific loaders. The automatic path is preferred for normal use.
Configuration
Environment variables
| Variable | Required | Purpose |
|---|---|---|
FIRECRAWL_API_KEY |
For firecrawl loader |
API key for the Firecrawl web extraction service |
FFMPEG_PATH |
Optional | Custom path to the FFmpeg binary used by Whisper / yt-dlp |
Docker
A Dockerfile is provided. The image includes Playwright with Chromium and runs xvfb-run for headless rendering.
docker build -t kabigon .
docker run --rm kabigon kabigon https://example.com
Project Structure
src/kabigon/
├── core/ # Loader ABC, exceptions, and shared helpers
├── loaders/ # Concrete loader implementations (one file per source)
├── pipelines/ # Pipeline catalog: maps URL patterns to loader chains
├── api.py # Public Python interface (load_url, explain_plan, …)
├── cli.py # Typer CLI entrypoint
└── load_chain.py # Chain execution and fallback logic
tests/
├── loaders/ # Per-loader unit tests
examples/ # Runnable usage samples
URL-to-pipeline matching lives in kabigon.pipelines; loader ordering and fallback policy live in kabigon.load_chain.
Development
git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium
Lint, format, and type-check:
uv run ruff check . # lint
uv run ruff format . # format
uv run ty check . # type check
uv run ruff check --fix . # auto-fix lint issues
Testing
# Full suite with coverage
uv run pytest -v -s --cov=src tests
# Single loader file
uv run pytest -v -s tests/loaders/test_youtube.py
# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name
Tests must be deterministic and must not rely on live network calls.
Troubleshooting
Playwright browser not installed
Executable doesn't exist at /path/to/chromium
playwright install chromium
FFmpeg not found
ffmpeg not found
Install FFmpeg or point to a custom binary:
# Ubuntu / Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
# Custom binary
export FFMPEG_PATH=/path/to/ffmpeg
Playwright timeout
Timeout 30000ms exceeded
Increase the timeout for slow-loading pages:
from kabigon.loaders import PlaywrightLoader
loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)
CAPTCHA / rate limiting
Some sites block automated access. kabigon automatically redirects Reddit requests to old.reddit.com to avoid CAPTCHAs. For other sites, add delays between requests or implement retry logic in your calling code.
Contributing
To add a new loader:
- Create
src/kabigon/loaders/<source>.pyand subclassLoader. - Implement
async def load(self, url: str) -> str. - Export the class from
src/kabigon/loaders/__init__.py. - Register the loader in
src/kabigon/loader_registry.py. - If the loader handles a specific source, add a pipeline entry in
src/kabigon/pipelines/catalog.py. - Update load-chain and planning consistency tests if the execution plan changes.
- Add loader tests in
tests/loaders/.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kabigon-0.19.2-py3-none-any.whl.
File metadata
- Download URL: kabigon-0.19.2-py3-none-any.whl
- Upload date:
- Size: 39.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72212b6ab72b0d97ed09b367ce325aa2d7f4b33e459adbd7df95715847c3eb79
|
|
| MD5 |
6ca856f788ded008a9e5c964249ca595
|
|
| BLAKE2b-256 |
5e6cd203d3f69025df780f1e9e460d18a3c7b21c3708881d044a50c9b29f48df
|