Skip to main content

No project description provided

Project description

kabigon

PyPI version Python 3.12+ License: MIT codecov

A Python library that extracts content from URLs and converts the result to text or markdown. Feed it a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon picks the right loader automatically.

Features

  • Smart routing — recognises YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDFs, and generic web pages, then selects the best extraction pipeline
  • Automatic fallback — if the primary loader fails, remaining loaders are tried in order without repeating work
  • Async-first — built on async/await; a synchronous wrapper is provided for convenience
  • Single-line APIkabigon.load_url_sync(url) is all you need to get started
  • Extensible — add a new loader by subclassing Loader and implementing one method

Table of Contents

Installation

# Install as a CLI tool
uv tool install kabigon

# Or run directly without installing
uvx kabigon <url>

After installation, install a browser for Playwright (required for generic web scraping):

playwright install chromium

Quick Start

import kabigon

# One line to load any URL
text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)

CLI

# Load content from a URL (auto-selects the best loader)
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ

# List all available loaders
kabigon --list

# Use a specific loader (or a comma-separated chain)
kabigon --loader youtube https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon --loader youtube,playwright https://www.youtube.com/watch?v=dQw4w9WgXcQ

Without --loader, kabigon routes the URL to a source-specific pipeline first, then falls back to the remaining default loaders without repeating already-attempted ones.

More examples:

kabigon https://x.com/elonmusk/status/123456789
kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
kabigon https://reddit.com/r/python/comments/xyz/...
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf

Python API

Sync

import kabigon

# Automatic loader selection
text = kabigon.load_url_sync("https://www.google.com")
print(text)

Async

import asyncio
import kabigon

async def main() -> None:
    text = await kabigon.load_url("https://www.google.com")
    print(text)

    # Parallel batch loading
    urls = [
        "https://x.com/user/status/123",
        "https://youtube.com/watch?v=abc",
        "https://reddit.com/r/python/comments/xyz",
    ]
    results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
    for url, content in zip(urls, results, strict=True):
        print(f"{url}: {len(content)} chars")

asyncio.run(main())

Custom Loader Chains

Use Compose to build a custom pipeline that tries loaders in order:

from kabigon.loaders import Compose, TwitterLoader, YoutubeLoader, PlaywrightLoader

loader = Compose([
    TwitterLoader(),
    YoutubeLoader(),
    PlaywrightLoader(),  # generic fallback
])
text = loader.load_sync("https://x.com/user/status/123")

Utility Functions

import kabigon

# Show which loaders kabigon would use for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)

# List all registered loader names
loaders = kabigon.available_loaders()
print(loaders)

API Summary

Style One-liner Custom chain
Sync kabigon.load_url_sync(url) loader.load_sync(url)
Async await kabigon.load_url(url) await loader.load(url)
Batch await asyncio.gather(*[kabigon.load_url(u) for u in urls]) await asyncio.gather(*[loader.load(u) for u in urls])

Supported Sources

Source Loader Notes
YouTube YoutubeLoader Transcript extraction via youtube-transcript-api
YouTube YoutubeYtdlpLoader Audio download + Whisper transcription
Twitter / X TwitterLoader Supports x.com, fxtwitter.com, vxtwitter.com, and others
Truth Social TruthSocialLoader Post content extraction
Reddit RedditLoader Posts and comments; auto-redirects to old.reddit.com
Instagram Reels ReelLoader Audio transcription via yt-dlp + Whisper
GitHub GitHubLoader File content from github.com/.../blob/... and raw.githubusercontent.com
BBC BBCLoader Article-aware HTML parsing
CNN CNNLoader Article-aware HTML parsing
PDF PDFLoader Text extraction from remote or local PDF files
PTT PttLoader Taiwan PTT (BBS) forum posts
Generic web PlaywrightLoader Full browser rendering via Playwright
Generic web HttpxLoader Lightweight HTTP fetch + HTML-to-markdown
Generic web FirecrawlLoader Web extraction via the Firecrawl API
Audio / Video YtdlpLoader Generic audio transcription via yt-dlp + Whisper

Architecture

kabigon follows a layered architecture:

Interface (CLI)  →  Application (routing, strategy, planning)  →  Domain (Loader ABC, models, errors)
                                                                ↓
                                                          Loaders (concrete implementations)

Request flow:

  1. The URL enters via the CLI or load_url().
  2. routing.py matches the URL against known patterns (YouTube, Twitter, …) to select a source-specific pipeline.
  3. strategy.py + planner.py build a LoaderPlan — the primary loaders followed by fallback loaders (de-duplicated).
  4. executor.py instantiates the loaders; Compose runs them in sequence and returns the first successful result.

To add a new loader, create a file in src/kabigon/loaders/, subclass Loader, implement async def load(self, url: str) -> str, register it in infrastructure/registry.py, and add a routing rule if the loader handles a specific domain.

Configuration

Environment Variables

Variable Purpose
FFMPEG_PATH Custom path to the FFmpeg binary (used by Whisper / yt-dlp audio transcription)
FIRECRAWL_API_KEY API key for the Firecrawl loader

Docker

A Dockerfile is provided for containerised usage:

docker build -t kabigon .

# "kabigon" after the image name is the CLI command
docker run --rm kabigon kabigon https://example.com

The image includes Playwright with Chromium and uses xvfb-run for headless browser rendering.

Troubleshooting

Playwright browser not installed

Executable doesn't exist at /path/to/chromium

Install the browser after installing kabigon:

playwright install chromium

FFmpeg not found

ffmpeg not found

Install FFmpeg:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

Or point to a custom binary:

export FFMPEG_PATH=/path/to/ffmpeg

Timeout errors

Timeout 30000ms exceeded

Increase the timeout for slow-loading pages:

from kabigon.loaders import PlaywrightLoader

loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)

CAPTCHA / rate limiting

Some websites block automated access. kabigon automatically uses old.reddit.com for Reddit to avoid CAPTCHAs. For other sites, consider adding delays between requests or implementing retry logic.

Development

Setup

git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium

Testing

# Full suite with coverage
uv run pytest -v -s --cov=src tests

# Single file
uv run pytest -v -s tests/loaders/test_youtube.py

# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name

Linting and Type Checking

uv run ruff check .       # lint
uv run ruff format .      # format
uv run ty check .         # type check
uv run ruff check --fix . # auto-fix lint issues

Building and Publishing

uv build -f wheel
uv publish

Adding a New Loader

  1. Create src/kabigon/loaders/<source>.py and subclass Loader.
  2. Implement async def load(self, url: str) -> str.
  3. Export the class from src/kabigon/loaders/__init__.py.
  4. Register the loader in src/kabigon/infrastructure/registry.py.
  5. Add a URL-matching rule in src/kabigon/application/routing.py (if domain-specific).
  6. Add tests in tests/loaders/.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kabigon-0.18.1-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file kabigon-0.18.1-py3-none-any.whl.

File metadata

  • Download URL: kabigon-0.18.1-py3-none-any.whl
  • Upload date:
  • Size: 36.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for kabigon-0.18.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd9451f11b3f21a122d343a61f45387febd0e365769ee6498c5f49556e96ef11
MD5 d59563cd7846993e1d0920b70dd7cbba
BLAKE2b-256 e54d8d87b7d954ae16386587081a56c08db4a62202c6752a3e10d87045f7e6e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page