No project description provided
Project description
kabigon
A URL content loader library that extracts content from various sources (YouTube, Instagram Reels, Twitter/X, Reddit, Truth Social, GitHub files, PDFs, web pages) and converts them to text/markdown format.
Features
✨ Multi-Platform Support: YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub files, PDFs, and generic web pages
🔄 Async-First Design: Built with async/await for efficient parallel processing
🎯 Smart Fallback: Automatically tries multiple extraction strategies until one succeeds
🚀 Simple API: Single-line usage with sensible defaults, or full control with custom loader chains
🔌 Extensible: Easy to add new loaders for additional platforms
Table of Contents
Installation
uv tool install kabigon
# or just
uvx kabigon <url>
# Install Playwright browsers
uvx playwright install chromium
# or
uvx playwright install chrome
Usage
CLI
uvx kabigon <url>
# Examples
uvx kabigon --list
uvx kabigon --loader youtube,playwright https://www.youtube.com/watch?v=dQw4w9WgXcQ
uvx kabigon --loader twitter https://x.com/elonmusk/status/123456789
uvx kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
uvx kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
uvx kabigon https://reddit.com/r/python/comments/xyz/...
uvx kabigon https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md
uvx kabigon https://example.com/document.pdf
Python API - Sync
import kabigon
url = "https://www.google.com.tw"
# Simplest usage - automatically uses the best loader
content = kabigon.load_url_sync(url)
print(content)
# Or use specific loader
content = kabigon.PlaywrightLoader().load_sync(url)
print(content)
# With multiple loaders (tries each in order)
loader = kabigon.Compose([
kabigon.TwitterLoader(),
kabigon.TruthSocialLoader(),
kabigon.YoutubeLoader(),
kabigon.RedditLoader(),
kabigon.PDFLoader(),
kabigon.PlaywrightLoader(), # Fallback for generic URLs
])
content = loader.load_sync(url)
print(content)
Python API - Async
import asyncio
import kabigon
async def main():
url = "https://www.google.com.tw"
# Simplest usage - automatically uses the best loader
content = await kabigon.load_url(url)
print(content)
# Or use specific loader
loader = kabigon.PlaywrightLoader()
content = await loader.load(url)
print(content)
# Batch processing multiple URLs in parallel
urls = [
"https://x.com/user1/status/123",
"https://truthsocial.com/@user/posts/456",
"https://youtube.com/watch?v=abc",
"https://reddit.com/r/python/comments/xyz",
]
loader = kabigon.Compose([
kabigon.TwitterLoader(),
kabigon.TruthSocialLoader(),
kabigon.YoutubeLoader(),
kabigon.RedditLoader(),
kabigon.PlaywrightLoader(),
])
# Parallel processing with automatic loader selection
results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
for url, content in zip(urls, results):
print(f"{url}: {len(content)} chars")
asyncio.run(main())
API Comparison
| Usage | Simplest | Custom Loader Chain |
|---|---|---|
| Sync | kabigon.load_url_sync(url) |
loader.load_sync(url) |
| Async | await kabigon.load_url(url) |
await loader.load(url) |
| Batch Async | await asyncio.gather(*[kabigon.load_url(url) for url in urls]) |
await asyncio.gather(*[loader.load(url) for url in urls]) |
Supported Sources
| Source | Loader | Description |
|---|---|---|
| YouTube | YoutubeLoader |
Extracts video transcripts |
| YouTube | YoutubeYtdlpLoader |
Audio transcription via yt-dlp + Whisper |
| Twitter/X | TwitterLoader |
Extracts tweet content |
| Truth Social | TruthSocialLoader |
Extracts Truth Social posts |
RedditLoader |
Extracts Reddit posts and comments | |
| Instagram Reels | ReelLoader |
Audio transcription + metadata |
| GitHub | GitHubLoader |
Fetches GitHub web pages and file content (supports repo URLs + github.com/.../blob/...) |
| BBC | BBCLoader |
BBC article extraction with article-aware parsing |
| CNN | CNNLoader |
CNN article extraction with article-aware parsing |
PDFLoader |
Extracts text from PDF files (URL or local) | |
| PTT | PttLoader |
Taiwan PTT forum posts |
| Generic Web | PlaywrightLoader |
Browser-based scraping for any website |
| Generic Web | HttpxLoader |
Simple HTTP requests with markdown conversion |
Examples
See the examples/ directory for more usage examples:
simple_usage.py- Basic single-line usageasync_usage.py- Async usage and parallel batch processingtwitter.py- Twitter/X post extractiontruthsocial.py- Truth Social post extractionread_reddit.py- Reddit post and comments extractionptt.py- PTT forum post extractionfetch_billgertz_tweet.py- Real-world Twitter scraping example
Troubleshooting
Playwright browser not installed
Error: Executable doesn't exist at /path/to/chromium
Solution: Install Playwright browsers after installing kabigon:
playwright install chromium
FFmpeg not found (for audio transcription)
Error: ffmpeg not found
Solution: Install FFmpeg for your platform:
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html
Or set custom FFmpeg path:
export FFMPEG_PATH=/path/to/ffmpeg
Timeout errors
Error: Timeout 30000ms exceeded
Solution: Increase timeout for slow-loading pages:
# Increase timeout to 60 seconds
loader = kabigon.PlaywrightLoader(timeout=60_000)
content = loader.load_sync(url)
CAPTCHA or rate limiting
Some websites may show CAPTCHAs or block automated access. For Reddit, kabigon automatically uses old.reddit.com to avoid CAPTCHAs. For other sites, you may need to:
- Add delays between requests
- Use a custom user agent
- Implement retry logic with exponential backoff
Development
Setup
# Clone the repository
git clone https://github.com/narumiruna/kabigon.git
cd kabigon
# Install dependencies with uv
uv sync
# Install Playwright browsers
playwright install chromium
Testing
# Run all tests with coverage
uv run pytest -v -s --cov=src tests
# Run specific test file
uv run pytest -v -s tests/loaders/test_youtube.py
Current test coverage: 69% (37 tests passing)
Linting and Type Checking
# Run linter
uv run ruff check .
# Run type checker
uv run ty check .
# Auto-fix linting issues
uv run ruff check --fix .
# Format code
uv run ruff format .
Building and Publishing
# Build wheel
uv build -f wheel
# Publish to PyPI
uv publish
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
When adding a new loader:
- Create a new file in
src/kabigon/loaders/ - Inherit from the
Loaderbase class - Implement
async def load(url: str) -> str - Add domain validation
- Add tests in
tests/loaders/ - Update documentation
See CLAUDE.md for detailed development guidelines.
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kabigon-0.17.2-py3-none-any.whl.
File metadata
- Download URL: kabigon-0.17.2-py3-none-any.whl
- Upload date:
- Size: 30.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67db982e71d248be90a017e89bfbd71a4ef79a4c7f120395367f531335a999a0
|
|
| MD5 |
37a8692d4fdb7535cdea4b5112e48952
|
|
| BLAKE2b-256 |
9c90eb9200f44929b1c4ca643d98e58117fb53e60fe6e95b478b3f6daf41419c
|