Skip to main content

Official Python SDK for Olyptik API

Project description

Olyptik Python SDK

The Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.

Installation

Install the SDK using pip:

pip install olyptik

Configuration

First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.

from olyptik import Olyptik

# Initialize with API key
client = Olyptik(api_key="your_api_key_here")

Synchronous Usage

Start a crawl

crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
# Start a crawl
crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50,
    "maxDepth": 2,
    "engineType": "auto",
    "includeLinks": True,
    "timeout": 60,
    "useSitemap": False,
    "useStaticIps": False
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")

Get crawl results

results = client.get_crawl_results(crawl.id)
for result in results.results:
    print(f"URL: {result.url}")
    print(f"Title: {result.title}")
    print(f"Depth: {result.depthOfUrl}")

Abort a crawl

aborted_crawl = client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")

Asynchronous Usage

For better performance with I/O operations, use the async client:

Start a crawl

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())
import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # Start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50,
            "maxDepth": 2,
            "engineType": "auto",
            "includeLinks": True,
            "timeout": 60,
            "useSitemap": False,
            "useStaticIps": False
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())

Get crawl results

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Get crawl results
        results = await client.get_crawl_results(crawl.id)
        for result in results.results:
            print(f"URL: {result.url}")
            print(f"Title: {result.title}")
            print(f"Depth: {result.depthOfUrl}")

asyncio.run(main())

Abort a crawl

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Abort the crawl
        aborted_crawl = await client.abort_crawl(crawl.id)
        print(f"Crawl aborted with ID: {aborted_crawl.id}")

asyncio.run(main())

Configuration Options

StartCrawlPayload

The crawl configuration options available:

The run crawl payload:

Property Type Required Default Description
startUrl string - The URL to start crawling from
maxResults number - Maximum number of results to collect (1-10,000)
maxDepth number 10 Maximum depth of pages to crawl (1-100)
includeLinks boolean true Whether to include links in the crawl results' markdown
useSitemap boolean false Whether to use sitemap.xml to crawl the website
timeout number 60 Timeout duration in minutes
engineType string "auto" The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites)
useStaticIps boolean false Whether to use static IPs for the crawl

Engine Types

Choose the appropriate engine for your crawling needs:

from olyptik import EngineType

# Available engine types
EngineType.AUTO        # Automatically choose the best engine
EngineType.PLAYWRIGHT  # Use Playwright for JavaScript-heavy sites
EngineType.CHEERIO     # Use Cheerio for faster, static content crawling

Crawl Status

Monitor your crawl status using the CrawlStatus enum:

from olyptik import CrawlStatus

# Possible status values
CrawlStatus.RUNNING    # Crawl is currently running
CrawlStatus.SUCCEEDED  # Crawl completed successfully
CrawlStatus.FAILED     # Crawl failed due to an error
CrawlStatus.TIMED_OUT  # Crawl exceeded timeout limit
CrawlStatus.ABORTED    # Crawl was manually aborted
CrawlStatus.ERROR      # Crawl encountered an error

Error Handling

The SDK provides comprehensive error handling:

from olyptik import Olyptik, OlyptikError, ApiError

client = Olyptik(api_key="your_api_key_here")

try:
    crawl = client.run_crawl({
        "startUrl": "https://example.com",
        "maxResults": 10
    })
except ApiError as e:
    print(f"API Error: {e.message}")
    print(f"Status Code: {e.status_code}")
except OlyptikError as e:
    print(f"SDK Error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Data Models

CrawlResult

Each crawl result contains:

@dataclass
class CrawlResult:
    crawlId: str          # Unique identifier for the crawl
    brandId: str          # Brand identifier
    url: str              # The crawled URL
    title: str            # Page title
    markdown: str         # Extracted content in markdown format
    depthOfUrl: int       # How deep this URL was in the crawl
    createdAt: str        # When the result was created

Crawl

Crawl metadata includes:

@dataclass
class Crawl:
    id: str                    # Unique crawl identifier
    status: CrawlStatus        # Current status
    startUrls: List[str]       # Starting URLs
    includeLinks: bool         # Whether links are included
    maxDepth: int              # Maximum crawl depth
    maxResults: int            # Maximum number of results
    brandId: str               # Brand identifier
    createdAt: str             # Creation timestamp
    completedAt: Optional[str] # Completion timestamp
    durationInSeconds: int     # Total duration
    numberOfResults: int       # Number of results found
    useSitemap: bool          # Whether sitemap was used
    timeout: int              # Timeout setting

Best Practices

1. Use Async for Better Performance

# ✅ Good: Use async for I/O intensive operations
async with AsyncOlyptik(api_key="your_api_key") as client:
    crawl = await client.run_crawl(payload)
    results = await client.get_crawl_results(crawl.id)

# ❌ Avoid: Blocking operations in async context
client = Olyptik(api_key="your_api_key")  # In async function

4. Choose the Right Engine

# ✅ Good: Choose engine based on site type
# For JavaScript-heavy sites
crawl = client.run_crawl({
    "startUrl": "https://spa-app.com",
    "engineType": EngineType.PLAYWRIGHT
})

# For static content sites
crawl = client.run_crawl({
    "startUrl": "https://blog.example.com", 
    "engineType": EngineType.CHEERIO
})

Troubleshooting

Common Issues

Import Error: Make sure you have installed the package correctly:

pip install --upgrade olyptik

Authentication Error: Verify your API key is correct and has sufficient permissions.

Timeout Issues: Increase the timeout value for large crawls:

crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "timeout": 300  # 5 minutes
})

Rate Limiting: The SDK automatically handles retries, but you can implement additional backoff:

import time
from olyptik import ApiError

try:
    crawl = client.run_crawl(payload)
except ApiError as e:
    if e.status_code == 429:
        time.sleep(60)  # Wait 1 minute
        crawl = client.run_crawl(payload)

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olyptik-0.1.1.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

olyptik-0.1.1-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file olyptik-0.1.1.tar.gz.

File metadata

  • Download URL: olyptik-0.1.1.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for olyptik-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b3b2601f62a2b58d673bd26f7bbea3e98cffee15e94847791f8e2304b085dc90
MD5 30fdfc44153b0d206aafacb56c9620d7
BLAKE2b-256 a3fe42a16934a98b53619d235c45f6b52ef20472d0a6f2743ea203275e7dcdb5

See more details on using hashes here.

File details

Details for the file olyptik-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: olyptik-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for olyptik-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6cfdaa78cbcde0b9b0119e919e7df7213e0f9c3a0a7bea0c30389fa9cc0c40c8
MD5 85c7cfb77a59a0a5c6e78a82b6b32cb2
BLAKE2b-256 692404593427e31732abd0403a62b1a4ee6bba11cd90808dcbec62d860e63d79

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page