Official Python SDK for ContentAPI — extract content from any URL
Project description
ContentAPI Python SDK
Official Python SDK for ContentAPI — extract structured content from any URL.
Features
- 🌐 Web extraction — Clean markdown/text from any webpage, with JS rendering
- 🎬 YouTube — Transcripts, metadata, comments, chapters, summaries, channels, playlists
- 🐦 Twitter/X — Tweet and thread extraction
- 🤖 Reddit — Post extraction
- 🔍 Web search — Search the web programmatically
- 🧠 AI extraction — Extract structured data with JSON schema or natural language
- 📝 AI summarization — Summarize any content with AI
- 🔗 Site crawling — Crawl entire websites (async with polling)
- 🔄 URL monitoring — Detect changes on web pages
- 📦 Batch — Extract multiple URLs in a single request
- ⚡ Async support — Full async/await with
httpx - 🔄 Auto-retry — Exponential backoff on rate limits and server errors
- 📐 Type-safe — Pydantic v2 models with full type hints
Installation
pip install contentapi
Quick Start
from contentapi import ContentAPI
client = ContentAPI(api_key="sk_live_...")
# Extract web content
result = client.web.extract("https://example.com")
print(result.title) # "Example Domain"
print(result.content) # Extracted content as markdown
print(result.word_count) # 17
Usage
Web Extraction
# Default extraction
result = client.web.extract("https://example.com")
# JavaScript rendering (for SPAs)
result = client.web.extract("https://spa-app.com", render_js=True)
# Bypass robots.txt
result = client.web.extract("https://example.com", ignore_robots=True)
# RAG chunking
result = client.web.extract("https://example.com", chunk_size=500, chunk_overlap=50)
# Access structured data
print(result.title)
print(result.content)
print(result.word_count)
YouTube
# Get transcript (with Whisper AI fallback for videos without captions)
transcript = client.youtube.transcript("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(transcript.title)
print(transcript.full_text)
for segment in transcript.segments:
print(f"[{segment.start:.1f}s] {segment.text}")
# Get video metadata
metadata = client.youtube.metadata("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(metadata.view_count, metadata.published_at)
# Get top comments
comments = client.youtube.comments("https://youtube.com/watch?v=dQw4w9WgXcQ", limit=20)
for c in comments.comments:
print(f"@{c.author}: {c.text} ({c.likes} likes)")
# Get chapters from description
chapters = client.youtube.chapters("https://youtube.com/watch?v=dQw4w9WgXcQ")
for ch in chapters.chapters:
print(f"{ch.formatted_time} - {ch.title}")
# AI-generated summary
summary = client.youtube.summary("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(summary.summary)
print(summary.key_points)
print(summary.topics)
# Channel metadata + recent videos
channel = client.youtube.channel("@mkbhd")
print(f"{channel.name} - {channel.subscribers} subscribers")
for video in channel.recent_videos:
print(f" {video.title} ({video.views} views)")
# Playlist extraction
playlist = client.youtube.playlist("https://youtube.com/playlist?list=PLrAXt...")
for video in playlist.videos:
print(f"#{video.position} {video.title}")
AI Schema Extraction
# Extract structured data with a JSON schema
data = client.ai.extract(
url="https://news.ycombinator.com",
schema={
"top_stories": [{"title": "string", "points": "number", "url": "string"}]
}
)
print(data.extracted) # Structured data matching your schema
# Or use natural language
data = client.ai.extract(
url="https://amazon.com/product/...",
prompt="Extract the product name, price, and rating"
)
print(data.extracted)
AI Summarization
result = client.ai.summarize(
content="Long article text here...",
title="Optional title"
)
print(result.summary) # Concise 2-3 sentence summary
print(result.key_points) # List of key takeaways
print(result.topics) # Auto-detected topics
Site Crawling
# Start an async crawl
crawl = client.crawl.start(
url="https://docs.example.com",
max_pages=50,
include_patterns=["/docs/*"],
webhook_url="https://myapp.com/hook" # Optional: get notified when done
)
print(f"Crawl started: {crawl.crawl_id}")
# Poll for results
import time
while True:
status = client.crawl.get(crawl.crawl_id)
if status.status in ("completed", "failed"):
break
print(f"Progress: {status.pages_completed}/{status.pages_found}")
time.sleep(5)
# Access results
for page in status.results:
print(f"{page.url} — {page.word_count} words")
print(page.content[:200])
URL Monitoring (Change Detection)
# Create a monitor
monitor = client.monitor.create(
url="https://competitor.com/pricing",
interval_hours=24,
webhook_url="https://myapp.com/changes"
)
print(f"Monitor active: {monitor.monitor_id}")
# List all monitors
monitors = client.monitor.list()
for m in monitors.monitors:
print(f"{m.url} — next check: {m.next_check}")
# Get change history
details = client.monitor.get(monitor.monitor_id)
for check in details.checks:
if check.changed:
print(f"Changed at {check.checked_at}: {check.diff_summary}")
# Delete a monitor
client.monitor.delete(monitor.monitor_id)
Twitter / X
tweet = client.twitter.tweet("https://x.com/user/status/123456789")
print(tweet.content)
thread = client.twitter.thread("https://x.com/user/status/123456789")
for tweet in thread.tweets:
print(tweet.text, tweet.likes)
post = client.reddit.post("https://reddit.com/r/Python/comments/abc123/my_post/")
print(post.title, post.score)
print(post.content)
Web Search
results = client.search("python RAG tutorial", count=5)
for item in results.results:
print(f"{item.title}: {item.url}")
Batch Extraction
batch = client.batch([
"https://example.com",
"https://youtube.com/watch?v=dQw4w9WgXcQ",
])
print(f"{batch.summary.succeeded}/{batch.summary.total} succeeded")
Async Usage
import asyncio
from contentapi import ContentAPI
async def main():
async with ContentAPI(api_key="sk_live_...", async_mode=True) as client:
# Parallel requests
web, yt = await asyncio.gather(
client.web.extract("https://example.com"),
client.youtube.transcript("https://youtube.com/watch?v=dQw4w9WgXcQ"),
)
print(web.title, yt.full_text[:100])
asyncio.run(main())
Error Handling
from contentapi import (
ContentAPI,
ContentAPIError,
AuthenticationError,
RateLimitError,
QuotaExceededError,
ExtractionError,
)
try:
result = client.web.extract("https://example.com")
except AuthenticationError:
print("Invalid API key!")
except RateLimitError as e:
print(f"Rate limited! Retry after {e.retry_after}s")
except QuotaExceededError:
print("Out of credits!")
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
The SDK automatically retries on 429 and 503 errors with exponential backoff.
Configuration
client = ContentAPI(
api_key="sk_live_...", # Required
base_url="https://api.example.com", # Custom base URL
timeout=60.0, # Request timeout (seconds)
max_retries=3, # Max retry attempts
)
Also Available
- TypeScript SDK —
npm install contentapi - MCP Server —
npx @contentapi/mcp-server(for Claude, Cursor, Windsurf) - LangChain —
pip install langchain-contentapi - LlamaIndex —
pip install llamaindex-contentapi
Requirements
- Python ≥ 3.9
httpx≥ 0.25pydantic≥ 2.0
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
contentapi-0.2.0.tar.gz
(15.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contentapi-0.2.0.tar.gz.
File metadata
- Download URL: contentapi-0.2.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9eaa10ebe8c917158497e000a910376c42b0ab7c313ffa1e0d6311f0350d4e74
|
|
| MD5 |
7717a443596b0140ec6100bd2fb9488b
|
|
| BLAKE2b-256 |
acb11173d38c90eb904554a82649aa69370179922951fc2636b73869fee7b419
|
File details
Details for the file contentapi-0.2.0-py3-none-any.whl.
File metadata
- Download URL: contentapi-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f009ac51348c80cbfe60e7512d1c7295a8731621c55b4c24f412ed3845bc4ad2
|
|
| MD5 |
8f5d3fc79dcfc48d5f53624f062316a0
|
|
| BLAKE2b-256 |
46b79d6239cadbcf9ef16452088888a55674cecb9d8ff483fee5d0e8e770c21f
|