No project description provided

Project description

kabigon

A Python library and CLI tool that extracts content from URLs and returns plain text or markdown. Point it at a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon selects the right loader automatically.

Intended for developers and data engineers who need reliable, source-aware text extraction without writing per-site scraping logic.

Features

Automatic loader selection for YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDF, and generic web pages
Fallback chain: if the primary loader fails, remaining loaders are tried in order without repeating already-attempted ones
Async-first (async/await) with a synchronous wrapper for scripts and notebooks
Single-line Python API: kabigon.load_url_sync(url)
CLI for ad-hoc extraction and debugging
Extensible: add a loader by subclassing Loader and implementing one method

Requirements

Python 3.12+
Playwright Chromium browser (for generic web scraping)
FFmpeg (only for audio/video transcription loaders)
FIRECRAWL_API_KEY environment variable (only for the Firecrawl loader)

Installation

# Install as a CLI tool
uv tool install kabigon

# Or run directly without installing
uvx kabigon <url>

After installation, install the Chromium browser for Playwright:

playwright install chromium

Quick Start

import kabigon

text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)

Usage

CLI

# Auto-select the best loader
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
kabigon https://x.com/user/status/123456789
kabigon https://reddit.com/r/python/comments/xyz/
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf

Python — sync

import kabigon

text = kabigon.load_url_sync("https://www.google.com")
print(text)

Python — async

import asyncio
import kabigon

async def main() -> None:
    text = await kabigon.load_url("https://www.google.com")
    print(text)

asyncio.run(main())

Parallel batch loading

import asyncio
import kabigon

async def main() -> None:
    urls = [
        "https://x.com/user/status/123",
        "https://youtube.com/watch?v=abc",
        "https://reddit.com/r/python/comments/xyz",
    ]
    results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
    for url, content in zip(urls, results, strict=True):
        print(f"{url}: {len(content)} chars")

asyncio.run(main())

API Reference

All public functions are importable from the kabigon package.

Function	Signature	Description
`load_url_sync`	`(url: str) -> str`	Load a URL synchronously using automatic loader selection
`load_url`	`async (url: str) -> str`	Load a URL asynchronously using automatic loader selection
`available_loaders`	`() -> list[str]`	Return names of all registered loaders
`explain_plan`	`(url: str) -> dict[str, object]`	Return the planned loader chain for a URL without executing it

import kabigon

# Inspect which loaders would be used for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)

# List all loader names
print(kabigon.available_loaders())

Architecture

Kabigon URL processing architecture

The automatic path uses kabigon.pipelines to select a source-aware pipeline, then kabigon.load_chain builds one ordered execution plan. Each loader is constructed only when its turn is reached; the first non-empty string is returned, and if every planned loader fails, kabigon raises LoaderError with the attempted loader details.

Mermaid source: docs/architecture/url-processing.mmd

Commands

`kabigon <url>`

Load content from a URL. Automatically selects the best loader.

kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ

`kabigon --list`

Print all available loaders and their descriptions.

kabigon --list

`kabigon --loader <names> <url>`

Override automatic loader selection with a comma-separated list of loader names, tried in order.

kabigon --loader twitter,playwright https://x.com/user/status/123

Use this only for debugging or testing specific loaders. The automatic path is preferred for normal use.

Configuration

Environment variables

Variable	Required	Purpose
`FIRECRAWL_API_KEY`	For `firecrawl` loader	API key for the Firecrawl web extraction service
`FFMPEG_PATH`	Optional	Custom path to the FFmpeg binary used by Whisper / yt-dlp

Docker

A Dockerfile is provided. The image includes Playwright with Chromium and runs xvfb-run for headless rendering.

docker build -t kabigon .
docker run --rm kabigon kabigon https://example.com

Project Structure

src/kabigon/
├── core/          # Loader ABC, exceptions, and shared helpers
├── loaders/       # Concrete loader implementations (one file per source)
├── pipelines/     # Pipeline catalog: maps URL patterns to loader chains
├── api.py         # Public Python interface (load_url, explain_plan, …)
├── cli.py         # Typer CLI entrypoint
└── load_chain.py  # Chain execution and fallback logic
tests/
├── loaders/       # Per-loader unit tests
examples/          # Runnable usage samples

URL-to-pipeline matching lives in kabigon.pipelines; loader ordering and fallback policy live in kabigon.load_chain.

Development

git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium

Lint, format, and type-check:

uv run ruff check .        # lint
uv run ruff format .       # format
uv run ty check .          # type check
uv run ruff check --fix .  # auto-fix lint issues

Testing

# Full suite with coverage
uv run pytest -v -s --cov=src tests

# Single loader file
uv run pytest -v -s tests/loaders/test_youtube.py

# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name

Tests must be deterministic and must not rely on live network calls.

Troubleshooting

Playwright browser not installed

Executable doesn't exist at /path/to/chromium

playwright install chromium

FFmpeg not found

ffmpeg not found

Install FFmpeg or point to a custom binary:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Custom binary
export FFMPEG_PATH=/path/to/ffmpeg

Playwright timeout

Timeout 30000ms exceeded

Increase the timeout for slow-loading pages:

from kabigon.loaders import PlaywrightLoader

loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)

CAPTCHA / rate limiting

Some sites block automated access. kabigon automatically redirects Reddit requests to old.reddit.com to avoid CAPTCHAs. For other sites, add delays between requests or implement retry logic in your calling code.

Contributing

To add a new loader:

Create src/kabigon/loaders/<source>.py and subclass Loader.
Implement async def load(self, url: str) -> str.
Export the class from src/kabigon/loaders/__init__.py.
Register the loader in src/kabigon/loader_registry.py.
If the loader handles a specific source, add a pipeline entry in src/kabigon/pipelines/catalog.py.
Update load-chain and planning consistency tests if the execution plan changes.
Add loader tests in tests/loaders/.

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.19.2

May 3, 2026

0.19.1

May 3, 2026

0.19.0

May 2, 2026

0.18.2

Apr 5, 2026

0.18.1

Apr 3, 2026

0.18.0

Mar 31, 2026

0.17.6

Mar 30, 2026

0.17.5

Mar 30, 2026

0.17.4

Mar 23, 2026

0.17.3

Mar 17, 2026

0.17.2

Mar 17, 2026

0.17.1

Mar 17, 2026

0.17.0

Mar 17, 2026

0.16.4

Feb 12, 2026

0.16.3

Feb 12, 2026

0.16.2

Feb 12, 2026

0.16.1

Feb 5, 2026

0.16.0

Jan 25, 2026

0.15.0

Jan 21, 2026

0.14.3

Jan 15, 2026

0.14.2

Jan 15, 2026

0.14.1

Jan 12, 2026

0.14.0

Jan 4, 2026

0.13.0

Jan 3, 2026

0.12.0

Jan 3, 2026

0.11.0

Jan 3, 2026

0.10.1

Jan 2, 2026

0.10.0

Jan 2, 2026

0.9.4

Oct 29, 2025

0.9.3

Oct 29, 2025

0.9.2

Sep 30, 2025

0.8.15

Sep 9, 2025

0.8.14

Sep 1, 2025

0.8.13

Sep 1, 2025

0.8.12

Aug 26, 2025

0.8.11

Jul 17, 2025

0.8.10

May 13, 2025

0.8.9

May 13, 2025

0.8.8

May 12, 2025

0.8.7

May 10, 2025

0.8.6

May 8, 2025

0.8.5

May 6, 2025

0.8.4

May 3, 2025

0.8.3

May 3, 2025

0.8.2

May 3, 2025

0.8.1

May 1, 2025

0.8.0

May 1, 2025

0.7.0

May 1, 2025

0.6.1

May 1, 2025

0.6.0

Apr 28, 2025

0.5.3

Apr 8, 2025

0.5.2

Mar 23, 2025

0.5.1

Mar 23, 2025

0.5.0

Mar 22, 2025

0.4.2

Mar 21, 2025

0.4.1

Mar 15, 2025

0.4.0

Mar 10, 2025

0.3.1

Feb 17, 2025

0.3.0

Feb 11, 2025

0.2.3

Feb 9, 2025

0.2.2

Feb 9, 2025

0.2.1

Feb 9, 2025

0.2.0

Feb 9, 2025

0.1.0

Feb 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kabigon-0.19.2-py3-none-any.whl (39.3 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file kabigon-0.19.2-py3-none-any.whl.

File metadata

Download URL: kabigon-0.19.2-py3-none-any.whl
Upload date: May 3, 2026
Size: 39.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for kabigon-0.19.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72212b6ab72b0d97ed09b367ce325aa2d7f4b33e459adbd7df95715847c3eb79`
MD5	`6ca856f788ded008a9e5c964249ca595`
BLAKE2b-256	`5e6cd203d3f69025df780f1e9e460d18a3c7b21c3708881d044a50c9b29f48df`

See more details on using hashes here.

kabigon 0.19.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

kabigon

Features

Requirements

Installation

Quick Start

Usage

CLI

Python — sync

Python — async

Parallel batch loading

API Reference

Architecture

Commands

kabigon <url>

kabigon --list

kabigon --loader <names> <url>

Configuration

Environment variables

Docker

Project Structure

Development

Testing

Troubleshooting

Playwright browser not installed

FFmpeg not found

Playwright timeout

CAPTCHA / rate limiting

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

`kabigon <url>`

`kabigon --list`

`kabigon --loader <names> <url>`