No project description provided

Project description

kabigon

A Python library that extracts content from URLs and converts the result to text or markdown. Feed it a YouTube video, a tweet, a Reddit thread, a PDF, or any web page — kabigon picks the right loader automatically.

Features

Smart routing — recognises YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub, BBC, CNN, PDFs, and generic web pages, then selects the best extraction pipeline
Automatic fallback — if the primary loader fails, remaining loaders are tried in order without repeating work
Async-first — built on async/await; a synchronous wrapper is provided for convenience
Single-line API — kabigon.load_url_sync(url) is all you need to get started
Extensible — add a new loader by subclassing Loader and implementing one method

Installation
Quick Start
CLI
Python API
Supported Sources
Architecture
Configuration
Troubleshooting
Development
License

Installation

# Install as a CLI tool
uv tool install kabigon

# Or run directly without installing
uvx kabigon <url>

After installation, install a browser for Playwright (required for generic web scraping):

playwright install chromium

Quick Start

import kabigon

# One line to load any URL
text = kabigon.load_url_sync("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(text)

CLI

# Load content from a URL (auto-selects the best loader)
kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ

# List all available loaders
kabigon --list

# Advanced: bypass automatic planning with a specific loader or loader chain
kabigon --loader youtube https://www.youtube.com/watch?v=dQw4w9WgXcQ

By default, kabigon routes the URL to a source-specific pipeline first, then falls back to the remaining default loaders without repeating already-attempted ones. Prefer this automatic path unless you are debugging or intentionally bypassing pipeline planning.

More examples:

kabigon https://x.com/elonmusk/status/123456789
kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
kabigon https://reddit.com/r/python/comments/xyz/...
kabigon https://github.com/user/repo/blob/main/README.md
kabigon https://example.com/document.pdf

Python API

Sync

import kabigon

# Automatic loader selection
text = kabigon.load_url_sync("https://www.google.com")
print(text)

Async

import asyncio
import kabigon

async def main() -> None:
    text = await kabigon.load_url("https://www.google.com")
    print(text)

    # Parallel batch loading
    urls = [
        "https://x.com/user/status/123",
        "https://youtube.com/watch?v=abc",
        "https://reddit.com/r/python/comments/xyz",
    ]
    results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
    for url, content in zip(urls, results, strict=True):
        print(f"{url}: {len(content)} chars")

asyncio.run(main())

Advanced Loader Selection

Most callers should use kabigon.load_url() or kabigon.load_url_sync() so pipeline planning, targeted loaders, and fallback policy stay in one place. For debugging or advanced experiments, the CLI can run an explicit loader order:

kabigon --loader twitter,playwright https://x.com/user/status/123

Utility Functions

import kabigon

# Show which loaders kabigon would use for a URL
plan = kabigon.explain_plan("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(plan)

# List all registered loader names
loaders = kabigon.available_loaders()
print(loaders)

API Summary

Style	Recommended interface	Advanced loader order
Sync	`kabigon.load_url_sync(url)`	`kabigon --loader name,name URL`
Async	`await kabigon.load_url(url)`	Use individual Loader adapters directly
Batch	`await asyncio.gather(*[kabigon.load_url(u) for u in urls])`	Use individual Loader adapters directly

Supported Sources

Source	Loader	Notes
YouTube	`YoutubeLoader`	Transcript extraction via `youtube-transcript-api`
YouTube	`YoutubeYtdlpLoader`	Audio download + Whisper transcription
Twitter / X	`TwitterLoader`	Supports `x.com`, `fxtwitter.com`, `vxtwitter.com`, and others
Truth Social	`TruthSocialLoader`	Post content extraction
Reddit	`RedditLoader`	Posts and comments; auto-redirects to `old.reddit.com`
Instagram Reels	`ReelLoader`	Audio transcription via yt-dlp + Whisper
GitHub	`GitHubLoader`	File content from `github.com/.../blob/...` and `raw.githubusercontent.com`
BBC	`BBCLoader`	Article-aware HTML parsing
CNN	`CNNLoader`	Article-aware HTML parsing
PDF	`PDFLoader`	Text extraction from remote or local PDF files
PTT	`PttLoader`	Taiwan PTT (BBS) forum posts
Generic web	`PlaywrightLoader`	Full browser rendering via Playwright
Generic web	`HttpxLoader`	Lightweight HTTP fetch + HTML-to-markdown
Generic web	`FirecrawlLoader`	Web extraction via the Firecrawl API
Audio / Video	`YtdlpLoader`	Generic audio transcription via yt-dlp + Whisper

Architecture

kabigon follows a layered architecture:

Interface (CLI)  →  Application (pipeline catalog, load chain)  →  Domain (Loader ABC, errors)
                                                                 ↓
                                                           Loaders (concrete implementations)

Request flow:

The URL enters via the CLI or load_url().
pipeline_catalog.py matches known sources (YouTube, Twitter, …) and returns the matched pipeline metadata.
load_chain.py turns that into a runnable load chain: targeted loaders followed by fallback loaders, plus explanation metadata.
load_chain.py executes the ordered Loader attempts and returns the first successful result.

explain_plan() returns Pipeline, Targeted loader, Fallback loader, requirement, and Execution plan metadata without constructing concrete loaders. Actual loading builds and executes the runnable Load chain.

There are three fallback levels to keep distinct:

Fallback loaders are added to the Execution plan after Targeted loaders when policy allows it.
The Load chain executes that ordered Loader list and records why each Loader failed.
A Loader may also have Loader-internal fallback, such as trying multiple source-specific fetch strategies inside one Loader.

To add a new loader, create a file in src/kabigon/loaders/, subclass Loader, implement async def load(self, url: str) -> str, register it in src/kabigon/loader_registry.py, and add a Pipeline catalog entry in src/kabigon/pipelines/catalog.py if the loader handles a specific source.

Configuration

Environment Variables

Variable	Purpose
`FFMPEG_PATH`	Custom path to the FFmpeg binary (used by Whisper / yt-dlp audio transcription)
`FIRECRAWL_API_KEY`	API key for the Firecrawl loader

Docker

A Dockerfile is provided for containerised usage:

docker build -t kabigon .

# "kabigon" after the image name is the CLI command
docker run --rm kabigon kabigon https://example.com

The image includes Playwright with Chromium and uses xvfb-run for headless browser rendering.

Troubleshooting

Playwright browser not installed

Executable doesn't exist at /path/to/chromium

Install the browser after installing kabigon:

playwright install chromium

FFmpeg not found

ffmpeg not found

Install FFmpeg:

# Ubuntu / Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

Or point to a custom binary:

export FFMPEG_PATH=/path/to/ffmpeg

Timeout errors

Timeout 30000ms exceeded

Increase the timeout for slow-loading pages:

from kabigon.loaders import PlaywrightLoader

loader = PlaywrightLoader(timeout=60_000)
text = loader.load_sync(url)

CAPTCHA / rate limiting

Some websites block automated access. kabigon automatically uses old.reddit.com for Reddit to avoid CAPTCHAs. For other sites, consider adding delays between requests or implementing retry logic.

Development

Setup

git clone https://github.com/narumiruna/kabigon.git
cd kabigon
uv sync
playwright install chromium

Testing

# Full suite with coverage
uv run pytest -v -s --cov=src tests

# Single file
uv run pytest -v -s tests/loaders/test_youtube.py

# Single test
uv run pytest -v -s tests/loaders/test_youtube.py::test_name

Linting and Type Checking

uv run ruff check .       # lint
uv run ruff format .      # format
uv run ty check .         # type check
uv run ruff check --fix . # auto-fix lint issues

Building and Publishing

uv build -f wheel
uv publish

Adding a New Loader

Create src/kabigon/loaders/<source>.py and subclass Loader.
Implement async def load(self, url: str) -> str.
Export the class from src/kabigon/loaders/__init__.py.
Register the loader in src/kabigon/loader_registry.py.
Add a Pipeline catalog entry in src/kabigon/pipelines/catalog.py if the loader handles a specific source.
Add or update Load chain and planning consistency tests when the Execution plan should change.
Add Loader tests in tests/loaders/.

License

MIT

Project details

Release history Release notifications | RSS feed

0.19.2

May 3, 2026

This version

0.19.1

May 3, 2026

0.19.0

May 2, 2026

0.18.2

Apr 5, 2026

0.18.1

Apr 3, 2026

0.18.0

Mar 31, 2026

0.17.6

Mar 30, 2026

0.17.5

Mar 30, 2026

0.17.4

Mar 23, 2026

0.17.3

Mar 17, 2026

0.17.2

Mar 17, 2026

0.17.1

Mar 17, 2026

0.17.0

Mar 17, 2026

0.16.4

Feb 12, 2026

0.16.3

Feb 12, 2026

0.16.2

Feb 12, 2026

0.16.1

Feb 5, 2026

0.16.0

Jan 25, 2026

0.15.0

Jan 21, 2026

0.14.3

Jan 15, 2026

0.14.2

Jan 15, 2026

0.14.1

Jan 12, 2026

0.14.0

Jan 4, 2026

0.13.0

Jan 3, 2026

0.12.0

Jan 3, 2026

0.11.0

Jan 3, 2026

0.10.1

Jan 2, 2026

0.10.0

Jan 2, 2026

0.9.4

Oct 29, 2025

0.9.3

Oct 29, 2025

0.9.2

Sep 30, 2025

0.8.15

Sep 9, 2025

0.8.14

Sep 1, 2025

0.8.13

Sep 1, 2025

0.8.12

Aug 26, 2025

0.8.11

Jul 17, 2025

0.8.10

May 13, 2025

0.8.9

May 13, 2025

0.8.8

May 12, 2025

0.8.7

May 10, 2025

0.8.6

May 8, 2025

0.8.5

May 6, 2025

0.8.4

May 3, 2025

0.8.3

May 3, 2025

0.8.2

May 3, 2025

0.8.1

May 1, 2025

0.8.0

May 1, 2025

0.7.0

May 1, 2025

0.6.1

May 1, 2025

0.6.0

Apr 28, 2025

0.5.3

Apr 8, 2025

0.5.2

Mar 23, 2025

0.5.1

Mar 23, 2025

0.5.0

Mar 22, 2025

0.4.2

Mar 21, 2025

0.4.1

Mar 15, 2025

0.4.0

Mar 10, 2025

0.3.1

Feb 17, 2025

0.3.0

Feb 11, 2025

0.2.3

Feb 9, 2025

0.2.2

Feb 9, 2025

0.2.1

Feb 9, 2025

0.2.0

Feb 9, 2025

0.1.0

Feb 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kabigon-0.19.1-py3-none-any.whl (38.7 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file kabigon-0.19.1-py3-none-any.whl.

File metadata

Download URL: kabigon-0.19.1-py3-none-any.whl
Upload date: May 3, 2026
Size: 38.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for kabigon-0.19.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41ba8062a75ae756e0603fb0f64550177c8639831e4e14e6f279d71b2d9e33b0`
MD5	`70e349bc235c58521948d2e508cb8007`
BLAKE2b-256	`b76f232b1f9cdc9ba723b56f1a1f7899615d12d65cdbfe9d90f905f782d231ea`

See more details on using hashes here.

kabigon 0.19.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

kabigon

Features

Table of Contents

Installation

Quick Start

CLI

Python API

Sync

Async

Advanced Loader Selection

Utility Functions

API Summary

Supported Sources

Architecture

Configuration

Environment Variables

Docker

Troubleshooting

Playwright browser not installed

FFmpeg not found

Timeout errors

CAPTCHA / rate limiting

Development

Setup

Testing

Linting and Type Checking

Building and Publishing

Adding a New Loader

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes