Python SDK for PostCrawl - The Fastest LLM Ready Social Media Crawler
Project description
PostCrawl Python SDK
Official Python SDK for PostCrawl - The Fastest LLM-Ready Social Media Crawler. Extract and search content from Reddit and TikTok with a simple, type-safe Python interface.
Features
- 🔍 Search across Reddit and TikTok with advanced filtering
- 📊 Extract content from social media URLs with optional comments
- 🚀 Combined search and extract in a single operation
- 🏷️ Type-safe with Pydantic models and full type hints
- ⚡ Async/await support with synchronous convenience methods
- 🛡️ Comprehensive error handling with detailed exceptions
- 📈 Rate limiting support with credit tracking
- 🔄 Automatic retries for network errors
- 🎯 Platform-specific models for Reddit and TikTok data with strong typing
- 📝 Rich content formatting with markdown support
- 🐍 Python 3.10+ with modern type annotations and snake_case naming
Installation
Using uv (Recommended)
uv is a fast Python package manager that we recommend:
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Add postcrawl to your project
uv add postcrawl
Using pip
pip install postcrawl
Optional: Environment Variables
For loading API keys from .env files:
uv add python-dotenv
# or
pip install python-dotenv
Requirements
- Python 3.10 or higher
- PostCrawl API key (Get one for free)
Quick Start
Async Usage (Recommended)
import asyncio
from postcrawl import PostCrawlClient
async def main():
# Initialize the client with your API key
async with PostCrawlClient(api_key="sk_your_api_key_here") as pc:
# Search for content
results = await pc.search(
social_platforms=["reddit"],
query="machine learning",
results=10,
page=1
)
# Process results
for post in results:
print(f"{post.title} - {post.url}")
print(f" Date: {post.date}")
print(f" Snippet: {post.snippet[:100]}...")
# Run the async function
asyncio.run(main())
Synchronous Usage
from postcrawl import PostCrawlClient
# Initialize the client
pc = PostCrawlClient(api_key="sk_your_api_key_here")
# Search synchronously
results = pc.search_sync(
social_platforms=["reddit", "tiktok"],
query="artificial intelligence",
results=5
)
# Extract content from URLs
posts = pc.extract_sync(
urls=["https://reddit.com/r/...", "https://tiktok.com/@..."],
include_comments=True
)
API Reference
Search
results = await pc.search(
social_platforms=["reddit", "tiktok"],
query="your search query",
results=10, # 1-100
page=1 # pagination
)
Extract
posts = await pc.extract(
urls=["https://reddit.com/...", "https://tiktok.com/..."],
include_comments=True,
response_mode="raw",
comment_filter_config={
"min_score": 10,
"max_depth": 2
}
)
Search and Extract
posts = await pc.search_and_extract(
social_platforms=["reddit"],
query="search query",
results=5,
page=1,
include_comments=True,
response_mode="markdown",
comment_filter_config={
"tier_limits": {"0": 5, "1": 3},
"preserve_high_quality_threads": True
}
)
Comment Filtering
The comment_filter_config dictionary allows you to filter comments server-side to reduce data transfer and improve performance:
from postcrawl.types import CommentFilterConfig
posts = await pc.extract(
urls=["..."],
include_comments=True,
comment_filter_config=CommentFilterConfig(
# Limit comments by depth level
tier_limits={
"0": 10, # Max 10 top-level comments
"1": 5, # Max 5 replies per comment
"2": 2 # Max 2 nested replies
},
# Minimum score/likes threshold
min_score=10,
# Minimum quality relative to top comment (0.0-1.0)
top_comment_percentile=0.1,
# Maximum depth to traverse
max_depth=5,
# Preserve more replies for high-quality threads
preserve_high_quality_threads=True,
high_quality_thread_score=100
)
)
Synchronous Methods
# All methods have synchronous versions
results = pc.search_sync(...)
posts = pc.extract_sync(...)
combined = pc.search_and_extract_sync(...)
Examples
Check out the examples/ directory for complete working examples:
search_101.py- Basic search functionality demoextract_101.py- Content extraction demosearch_and_extract_101.py- Combined operation demo
Run examples with:
# Using uv (recommended)
uv run python examples/search_101.py
# Or with standard Python
cd examples
python search_101.py
Response Models
SearchResult
Response from the search endpoint:
title: Title of the search resulturl: URL of the search resultsnippet: Text snippet from the contentdate: Date of the post (e.g., "Dec 28, 2024")image_url: URL of associated image (can be empty string)
ExtractedPost
url: Original URLsource: Platform name ("reddit" or "tiktok")raw: Raw content data (RedditPost or TiktokPost object) - strongly typedmarkdown: Markdown formatted content (when response_mode="markdown")error: Error message if extraction failed
Working with Platform-Specific Types
The SDK provides type-safe access to platform-specific data:
from postcrawl import PostCrawlClient, RedditPost, TiktokPost
# Extract content with proper type handling
posts = await pc.extract(urls=["https://reddit.com/..."])
for post in posts:
if post.error:
print(f"Error: {post.error}")
elif isinstance(post.raw, RedditPost):
# Access Reddit-specific fields with snake_case attributes
print(f"Subreddit: r/{post.raw.subreddit_name}")
print(f"Score: {post.raw.score}")
print(f"Title: {post.raw.title}")
print(f"Upvotes: {post.raw.upvotes}")
print(f"Created: {post.raw.created_at}")
if post.raw.comments:
print(f"Comments: {len(post.raw.comments)}")
elif isinstance(post.raw, TiktokPost):
# Access TikTok-specific fields with snake_case attributes
print(f"Username: @{post.raw.username}")
print(f"Likes: {post.raw.likes}")
print(f"Total Comments: {post.raw.total_comments}")
print(f"Created: {post.raw.created_at}")
if post.raw.hashtags:
print(f"Hashtags: {', '.join(post.raw.hashtags)}")
Error Handling
from postcrawl.exceptions import (
AuthenticationError, # Invalid API key
InsufficientCreditsError, # Not enough credits
RateLimitError, # Rate limit exceeded
ValidationError # Invalid parameters
)
Development
This project uses uv for dependency management. See DEVELOPMENT.md for detailed setup and contribution guidelines.
Quick Development Setup
# Clone the repository
git clone https://github.com/post-crawl/python-sdk.git
cd python-sdk
# Install dependencies
uv sync
# Run tests
make test
# Run all checks (format, lint, test)
make check
# Build the package
make build
Available Commands
make help # Show all available commands
make format # Format code with black and ruff
make lint # Run linting and type checking
make test # Run test suite
make check # Run format, lint, and tests
make build # Build distribution packages
make verify # Verify package installation
make publish-test # Publish to TestPyPI
API Key Management
Environment Variables (Recommended)
Store your API key securely in environment variables:
export POSTCRAWL_API_KEY="sk_your_api_key_here"
Or use a .env file:
# .env
POSTCRAWL_API_KEY=sk_your_api_key_here
Then load it in your code:
import os
from dotenv import load_dotenv
from postcrawl import PostCrawlClient
load_dotenv()
pc = PostCrawlClient(api_key=os.getenv("POSTCRAWL_API_KEY"))
Security Best Practices
- Never hardcode API keys in your source code
- Add
.envto.gitignoreto prevent accidental commits - Use environment variables in production
- Rotate keys regularly through the PostCrawl dashboard
- Set key permissions to limit access to specific operations
Rate Limits & Credits
PostCrawl uses a credit-based system:
- Search: ~1 credit per 10 results
- Extract: ~1 credit per URL (without comments)
- Extract with comments: ~3 credits per URL
Rate limits are returned in response headers:
pc = PostCrawlClient(api_key="sk_...")
results = await pc.search(...)
print(f"Rate limit: {pc.rate_limit_info['limit']}")
print(f"Remaining: {pc.rate_limit_info['remaining']}")
print(f"Reset at: {pc.rate_limit_info['reset']}")
Support
- Documentation: github.com/post-crawl/python-sdk
- Issues: github.com/post-crawl/python-sdk/issues
- Email: support@postcrawl.com
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file postcrawl-1.2.0.tar.gz.
File metadata
- Download URL: postcrawl-1.2.0.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b431037ad80e6c811a36612abe11f9437e84937c901a5d5d6baf4030a291cb3e
|
|
| MD5 |
2ca77299817d4b454264bbd8f56cf00e
|
|
| BLAKE2b-256 |
337ae3c60d28e0730cdbe93d75aee931783afa9cae7ed113c3d140d2a72a8645
|
File details
Details for the file postcrawl-1.2.0-py3-none-any.whl.
File metadata
- Download URL: postcrawl-1.2.0-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f8ffea3acfb7a6289a20751e5bb627d343de6e6fb89d1f405cd99f60566e7e1
|
|
| MD5 |
7932defa1872eb90ceb0db19c6179ff4
|
|
| BLAKE2b-256 |
eeb43e128e45fb6eeb493a139b076bbde9ba50ccaa6a14b04a9c188a6436de6c
|