Skip to main content

Python SDK for PostCrawl - The Fastest LLM Ready Social Media Crawler

Project description

PostCrawl Python SDK

Official Python SDK for PostCrawl - The Fastest LLM-Ready Social Media Crawler. Extract and search content from Reddit and TikTok with a simple, type-safe Python interface.

Features

  • 🔍 Search across Reddit and TikTok with advanced filtering
  • 📊 Extract content from social media URLs with optional comments
  • 🚀 Combined search and extract in a single operation
  • 🏷️ Type-safe with Pydantic models and full type hints
  • Async/await support with synchronous convenience methods
  • 🛡️ Comprehensive error handling with detailed exceptions
  • 📈 Rate limiting support with credit tracking
  • 🔄 Automatic retries for network errors
  • 🎯 Platform-specific models for Reddit and TikTok data with strong typing
  • 📝 Rich content formatting with markdown support
  • 🐍 Python 3.10+ with modern type annotations and snake_case naming

Installation

Using uv (Recommended)

uv is a fast Python package manager that we recommend:

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add postcrawl to your project
uv add postcrawl

Using pip

pip install postcrawl

Optional: Environment Variables

For loading API keys from .env files:

uv add python-dotenv
# or
pip install python-dotenv

Requirements

Quick Start

Async Usage (Recommended)

import asyncio
from postcrawl import PostCrawlClient

async def main():
    # Initialize the client with your API key
    async with PostCrawlClient(api_key="sk_your_api_key_here") as pc:
        # Search for content
        results = await pc.search(
            social_platforms=["reddit"],
            query="machine learning",
            results=10,
            page=1
        )

        # Process results
        for post in results:
            print(f"{post.title} - {post.url}")
            print(f"  Date: {post.date}")
            print(f"  Snippet: {post.snippet[:100]}...")

# Run the async function
asyncio.run(main())

Synchronous Usage

from postcrawl import PostCrawlClient

# Initialize the client
pc = PostCrawlClient(api_key="sk_your_api_key_here")

# Search synchronously
results = pc.search_sync(
    social_platforms=["reddit", "tiktok"],
    query="artificial intelligence",
    results=5
)

# Extract content from URLs
posts = pc.extract_sync(
    urls=["https://reddit.com/r/...", "https://tiktok.com/@..."],
    include_comments=True
)

API Reference

Search

results = await pc.search(
    social_platforms=["reddit", "tiktok"],
    query="your search query",
    results=10,  # 1-100
    page=1       # pagination
)

Extract

posts = await pc.extract(
    urls=["https://reddit.com/...", "https://tiktok.com/..."],
    include_comments=True,
    response_mode="raw",
    comment_filter_config={
        "min_score": 10,
        "max_depth": 2
    }
)

Search and Extract

posts = await pc.search_and_extract(
    social_platforms=["reddit"],
    query="search query",
    results=5,
    page=1,
    include_comments=True,
    response_mode="markdown",
    comment_filter_config={
        "tier_limits": {"0": 5, "1": 3},
        "preserve_high_quality_threads": True
    }
)

Comment Filtering

The comment_filter_config dictionary allows you to filter comments server-side to reduce data transfer and improve performance:

from postcrawl.types import CommentFilterConfig

posts = await pc.extract(
    urls=["..."],
    include_comments=True,
    comment_filter_config=CommentFilterConfig(
        # Limit comments by depth level
        tier_limits={
            "0": 10, # Max 10 top-level comments
            "1": 5,  # Max 5 replies per comment
            "2": 2   # Max 2 nested replies
        },
        
        # Minimum score/likes threshold
        min_score=10,
        
        # Minimum quality relative to top comment (0.0-1.0)
        top_comment_percentile=0.1,
        
        # Maximum depth to traverse
        max_depth=5,
        
        # Preserve more replies for high-quality threads
        preserve_high_quality_threads=True,
        high_quality_thread_score=100
    )
)

Synchronous Methods

# All methods have synchronous versions
results = pc.search_sync(...)
posts = pc.extract_sync(...)
combined = pc.search_and_extract_sync(...)

Examples

Check out the examples/ directory for complete working examples:

  • search_101.py - Basic search functionality demo
  • extract_101.py - Content extraction demo
  • search_and_extract_101.py - Combined operation demo

Run examples with:

# Using uv (recommended)
uv run python examples/search_101.py

# Or with standard Python
cd examples
python search_101.py

Response Models

SearchResult

Response from the search endpoint:

  • title: Title of the search result
  • url: URL of the search result
  • snippet: Text snippet from the content
  • date: Date of the post (e.g., "Dec 28, 2024")
  • image_url: URL of associated image (can be empty string)

ExtractedPost

  • url: Original URL
  • source: Platform name ("reddit" or "tiktok")
  • raw: Raw content data (RedditPost or TiktokPost object) - strongly typed
  • markdown: Markdown formatted content (when response_mode="markdown")
  • error: Error message if extraction failed

Working with Platform-Specific Types

The SDK provides type-safe access to platform-specific data:

from postcrawl import PostCrawlClient, RedditPost, TiktokPost

# Extract content with proper type handling
posts = await pc.extract(urls=["https://reddit.com/..."])

for post in posts:
    if post.error:
        print(f"Error: {post.error}")
    elif isinstance(post.raw, RedditPost):
        # Access Reddit-specific fields with snake_case attributes
        print(f"Subreddit: r/{post.raw.subreddit_name}")
        print(f"Score: {post.raw.score}")
        print(f"Title: {post.raw.title}")
        print(f"Upvotes: {post.raw.upvotes}")
        print(f"Created: {post.raw.created_at}")
        if post.raw.comments:
            print(f"Comments: {len(post.raw.comments)}")
    elif isinstance(post.raw, TiktokPost):
        # Access TikTok-specific fields with snake_case attributes
        print(f"Username: @{post.raw.username}")
        print(f"Likes: {post.raw.likes}")
        print(f"Total Comments: {post.raw.total_comments}")
        print(f"Created: {post.raw.created_at}")
        if post.raw.hashtags:
            print(f"Hashtags: {', '.join(post.raw.hashtags)}")

Error Handling

from postcrawl.exceptions import (
    AuthenticationError,      # Invalid API key
    InsufficientCreditsError, # Not enough credits
    RateLimitError,          # Rate limit exceeded
    ValidationError          # Invalid parameters
)

Development

This project uses uv for dependency management. See DEVELOPMENT.md for detailed setup and contribution guidelines.

Quick Development Setup

# Clone the repository
git clone https://github.com/post-crawl/python-sdk.git
cd python-sdk

# Install dependencies
uv sync

# Run tests
make test

# Run all checks (format, lint, test)
make check

# Build the package
make build

Available Commands

make help         # Show all available commands
make format       # Format code with black and ruff
make lint         # Run linting and type checking
make test         # Run test suite
make check        # Run format, lint, and tests
make build        # Build distribution packages
make verify       # Verify package installation
make publish-test # Publish to TestPyPI

API Key Management

Environment Variables (Recommended)

Store your API key securely in environment variables:

export POSTCRAWL_API_KEY="sk_your_api_key_here"

Or use a .env file:

# .env
POSTCRAWL_API_KEY=sk_your_api_key_here

Then load it in your code:

import os
from dotenv import load_dotenv
from postcrawl import PostCrawlClient

load_dotenv()
pc = PostCrawlClient(api_key=os.getenv("POSTCRAWL_API_KEY"))

Security Best Practices

  • Never hardcode API keys in your source code
  • Add .env to .gitignore to prevent accidental commits
  • Use environment variables in production
  • Rotate keys regularly through the PostCrawl dashboard
  • Set key permissions to limit access to specific operations

Rate Limits & Credits

PostCrawl uses a credit-based system:

  • Search: ~1 credit per 10 results
  • Extract: ~1 credit per URL (without comments)
  • Extract with comments: ~3 credits per URL

Rate limits are returned in response headers:

pc = PostCrawlClient(api_key="sk_...")
results = await pc.search(...)

print(f"Rate limit: {pc.rate_limit_info['limit']}")
print(f"Remaining: {pc.rate_limit_info['remaining']}")
print(f"Reset at: {pc.rate_limit_info['reset']}")

Support

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

postcrawl-1.2.0.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

postcrawl-1.2.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file postcrawl-1.2.0.tar.gz.

File metadata

  • Download URL: postcrawl-1.2.0.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.20

File hashes

Hashes for postcrawl-1.2.0.tar.gz
Algorithm Hash digest
SHA256 b431037ad80e6c811a36612abe11f9437e84937c901a5d5d6baf4030a291cb3e
MD5 2ca77299817d4b454264bbd8f56cf00e
BLAKE2b-256 337ae3c60d28e0730cdbe93d75aee931783afa9cae7ed113c3d140d2a72a8645

See more details on using hashes here.

File details

Details for the file postcrawl-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: postcrawl-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.20

File hashes

Hashes for postcrawl-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5f8ffea3acfb7a6289a20751e5bb627d343de6e6fb89d1f405cd99f60566e7e1
MD5 7932defa1872eb90ceb0db19c6179ff4
BLAKE2b-256 eeb43e128e45fb6eeb493a139b076bbde9ba50ccaa6a14b04a9c188a6436de6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page