Skip to main content

No project description provided

Project description

kabigon

PyPI version Python 3.12+ License: MIT codecov

A Python library and CLI tool that extracts content from URLs and returns plain text or markdown. Point it at a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon selects the right loader automatically.

Intended for developers and data engineers who need reliable, source-aware text extraction without writing per-site scraping logic.

Features

  • Automatic loader selection for YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDF, and generic web pages
  • Fallback chain: if the primary loader fails, remaining loaders are tried in order without repeating already-attempted ones
  • Async-first (async/await) with a synchronous wrapper for scripts and notebooks
  • Single-line Python API: kabigon.load_url_sync(url)
  • CLI for ad-hoc extraction and debugging
  • Extensible: add a loader by subclassing Loader and implementing one method

Requirements

  • Python 3.12+
  • Playwright Chromium browser (for generic web scraping)
  • FFmpeg (only for audio/video transcription loaders)
  • FIRECRAWL_API_KEY environment variable (only for the Firecrawl loader)

Installation

# Install as a CLI tool
uv tool install kabigon

# Or run directly without installing
uvx kabigon <url>

After installation, install the Chromium browser for Playwright:

playwright install chromium

Quick Start

import kabigon

text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)

Usage

CLI

# Auto-select the best loader
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon https://x.com/user/status/123456789
kabigon https://reddit.com/r/python/comments/xyz/
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf

Python — sync

import kabigon

text = kabigon.load_url_sync("https://www.google.com")
print(text)

Python — async

import asyncio
import kabigon

async def main() -> None:
    text = await kabigon.load_url("https://www.google.com")
    print(text)

asyncio.run(main())

Parallel batch loading

import asyncio
import kabigon

async def main() -> None:
    urls = [
        "https://x.com/user/status/123",
        "https://youtube.com/watch?v=abc",
        "https://reddit.com/r/python/comments/xyz",
    ]
    results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
    for url, content in zip(urls, results, strict=True):
        print(f"{url}: {len(content)} chars")

asyncio.run(main())

API Reference

All public functions are importable from the kabigon package.

Function Signature Description
load_url_sync (url: str) -> str Load a URL synchronously using automatic loader selection
load_url async (url: str) -> str Load a URL asynchronously using automatic loader selection
available_loaders () -> list[str] Return names of all registered loaders
explain_plan (url: str) -> dict[str, object] Return the planned loader chain for a URL without executing it
import kabigon

# Inspect which loaders would be used for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)

# List all loader names
print(kabigon.available_loaders())

Architecture

Kabigon URL processing architecture

The automatic path uses kabigon.pipelines to select a source-aware pipeline, then kabigon.load_chain builds one ordered execution plan. Each loader is constructed only when its turn is reached; the first non-empty string is returned, and if every planned loader fails, kabigon raises LoaderError with the attempted loader details.

Mermaid source: docs/architecture/url-processing.mmd

Commands

kabigon <url>

Load content from a URL. Automatically selects the best loader.

kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ

kabigon --list

Print all available loaders and their descriptions.

kabigon --list

kabigon --loader <names> <url>

Override automatic loader selection with a comma-separated list of loader names, tried in order.

kabigon --loader twitter,playwright https://x.com/user/status/123

Use this only for debugging or testing specific loaders. The automatic path is preferred for normal use.

Configuration

Environment variables

Variable Required Purpose
FIRECRAWL_API_KEY For firecrawl loader API key for the Firecrawl web extraction service
FFMPEG_PATH Optional Custom path to the FFmpeg binary used by Whisper / yt-dlp

Docker

A Dockerfile is provided. The image includes Playwright with Chromium and runs xvfb-run for headless rendering.

docker build -t kabigon .
docker run --rm kabigon kabigon https://example.com

Project Structure

src/kabigon/
├── core/          # Loader ABC, exceptions, and shared helpers
├── loaders/       # Concrete loader implementations (one file per source)
├── pipelines/     # Pipeline catalog: maps URL patterns to loader chains
├── api.py         # Public Python interface (load_url, explain_plan, …)
├── cli.py         # Typer CLI entrypoint
└── load_chain.py  # Chain execution and fallback logic
tests/
├── loaders/       # Per-loader unit tests
examples/          # Runnable usage samples

URL-to-pipeline matching lives in kabigon.pipelines; loader ordering and fallback policy live in kabigon.load_chain.

Development

git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium

Lint, format, and type-check:

uv run ruff check .        # lint
uv run ruff format .       # format
uv run ty check .          # type check
uv run ruff check --fix .  # auto-fix lint issues

Testing

# Full suite with coverage
uv run pytest -v -s --cov=src tests

# Single loader file
uv run pytest -v -s tests/loaders/test_youtube.py

# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name

Tests must be deterministic and must not rely on live network calls.

Troubleshooting

Playwright browser not installed

Executable doesn't exist at /path/to/chromium
playwright install chromium

FFmpeg not found

ffmpeg not found

Install FFmpeg or point to a custom binary:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Custom binary
export FFMPEG_PATH=/path/to/ffmpeg

Playwright timeout

Timeout 30000ms exceeded

Increase the timeout for slow-loading pages:

from kabigon.loaders import PlaywrightLoader

loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)

CAPTCHA / rate limiting

Some sites block automated access. kabigon automatically redirects Reddit requests to old.reddit.com to avoid CAPTCHAs. For other sites, add delays between requests or implement retry logic in your calling code.

Contributing

To add a new loader:

  1. Create src/kabigon/loaders/<source>.py and subclass Loader.
  2. Implement async def load(self, url: str) -> str.
  3. Export the class from src/kabigon/loaders/__init__.py.
  4. Register the loader in src/kabigon/loader_registry.py.
  5. If the loader handles a specific source, add a pipeline entry in src/kabigon/pipelines/catalog.py.
  6. Update load-chain and planning consistency tests if the execution plan changes.
  7. Add loader tests in tests/loaders/.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kabigon-0.19.2-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file kabigon-0.19.2-py3-none-any.whl.

File metadata

  • Download URL: kabigon-0.19.2-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for kabigon-0.19.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72212b6ab72b0d97ed09b367ce325aa2d7f4b33e459adbd7df95715847c3eb79
MD5 6ca856f788ded008a9e5c964249ca595
BLAKE2b-256 5e6cd203d3f69025df780f1e9e460d18a3c7b21c3708881d044a50c9b29f48df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page