Skip to main content

Official Python SDK for Olyptik API

Project description

Olyptik Python SDK

The Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.

Installation

Install the SDK using pip:

pip install olyptik

Configuration

First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.

from olyptik import Olyptik

# Initialize with API key
client = Olyptik(api_key="your_api_key_here")

Synchronous Usage

Start a crawl

Minimal settings crawl:

crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")

Full example:

# Start a crawl
crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50,
    "maxDepth": 2,
    "engineType": "auto",
    "includeLinks": True,
    "timeout": 60,
    "useSitemap": False,
    "entireWebsite": False,
    "excludeNonMainTags": True,
    "useStaticIps": False
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")

Query crawls

from olyptik import CrawlStatus

result = client.query_crawls({
    "startUrls": ["https://example.com"],
    "status": [CrawlStatus.SUCCEEDED],
    "page": 0,
})

print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)

Getting Crawl Results

Retrieve the results of your crawl using the crawl ID. The results are paginated, and you can specify the page number and limit per page.

limit = 50
page = 0
results = client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
    print(f"URL: {result.url}")
    print(f"Title: {result.title}")
    print(f"Depth: {result.depthOfUrl}")

Abort a crawl

aborted_crawl = client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")

Asynchronous Usage

For better performance with I/O operations, use the async client:

Start a crawl

Minimal settings crawl:

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())

Full example:

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # Start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50,
            "maxDepth": 2,
            "engineType": "auto",
            "includeLinks": True,
            "timeout": 60,
            "useSitemap": False,
            "entireWebsite": False,
            "excludeNonMainTags": True,
            "useStaticIps": False
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())

Query crawls

import asyncio
from olyptik import AsyncOlyptik, CrawlStatus

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        result = await client.query_crawls({
            "startUrls": ["https://example.com"],
            "status": [CrawlStatus.SUCCEEDED],
            "page": 0,
        })
        
        print("Crawls: ", result.results)
        print("Page: ", result.page)
        print("Total pages: ", result.totalPages)
        print("Count of items per page: ", result.limit)
        print("Total matched crawls: ", result.totalResults)

asyncio.run(main())

Get crawl results

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Get crawl results
        limit = 50
        page = 0
        results = await client.get_crawl_results(crawl.id, page, limit)
        for result in results.results:
            print(f"URL: {result.url}")
            print(f"Title: {result.title}")
            print(f"Depth: {result.depthOfUrl}")

asyncio.run(main())

Abort a crawl

import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Abort the crawl
        aborted_crawl = await client.abort_crawl(crawl.id)
        print(f"Crawl aborted with ID: {aborted_crawl.id}")

asyncio.run(main())

Configuration Options

StartCrawlPayload

The crawl configuration options available:

You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.

Property Type Required Default Description
startUrl string - The URL to start crawling from
maxResults number - Maximum number of results to collect (1-5,000)
useSitemap boolean false Whether to use sitemap.xml to crawl the website
entireWebsite boolean false Whether to use sitemap.xml and all found links to crawl the website
maxDepth number 10 Maximum depth of pages to crawl (1-100)
includeLinks boolean true Whether to include links in the crawl results' markdown
excludeNonMainTags boolean true Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results
timeout number 60 Timeout duration in minutes
engineType string "auto" The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites)
useStaticIps boolean false Whether to use static IPs for the crawl

Engine Types

Choose the appropriate engine for your crawling needs:

from olyptik import EngineType

# Available engine types
EngineType.AUTO        # Automatically choose the best engine
EngineType.PLAYWRIGHT  # Use Playwright for JavaScript-heavy sites
EngineType.CHEERIO     # Use Cheerio for faster, static content crawling

Crawl Status

Monitor your crawl status using the CrawlStatus enum:

from olyptik import CrawlStatus

# Possible status values
CrawlStatus.RUNNING    # Crawl is currently running
CrawlStatus.SUCCEEDED  # Crawl completed successfully
CrawlStatus.FAILED     # Crawl failed due to an error
CrawlStatus.TIMED_OUT  # Crawl exceeded timeout limit
CrawlStatus.ABORTED    # Crawl was manually aborted
CrawlStatus.ERROR      # Crawl encountered an error

Error Handling

The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:

from olyptik import Olyptik, ApiError

client = Olyptik(api_key="your_api_key_here")

try:
    crawl = client.run_crawl({
        "startUrl": "https://example.com",
        "maxResults": 10
    })
except ApiError as e:
    # API returned an error response
    print(f"API Error: {e.message}")
    print(f"Status Code: {e.status_code}")

Data Models

CrawlResult

Each crawl result contains:

@dataclass
class CrawlResult:
    crawlId: str          # Unique identifier for the crawl
    teamId: str          # Team identifier
    url: str              # The crawled URL
    title: str            # Page title
    markdown: str         # Extracted content in markdown format
    depthOfUrl: int       # How deep this URL was in the crawl
    createdAt: str        # When the result was created

Crawl

Crawl metadata includes:

@dataclass
class Crawl:
    id: str                    # Unique crawl identifier
    status: CrawlStatus        # Current status
    startUrls: List[str]       # Starting URLs
    includeLinks: bool         # Whether links are included
    maxDepth: int              # Maximum crawl depth
    maxResults: int            # Maximum number of results
    teamId: str               # Team identifier
    createdAt: str             # Creation timestamp
    completedAt: Optional[str] # Completion timestamp
    durationInSeconds: int     # Total duration
    numberOfResults: int       # Number of results found
    useSitemap: bool          # Whether sitemap was used
    entireWebsite: bool       # Whether to use both sitemap and all found links
    excludeNonMainTags: bool  # Whether non-main HTML tags were excluded
    timeout: int              # Timeout setting
    useStaticIps: bool        # Whether static IPs were used
    engineType: EngineType    # Engine type used

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

olyptik-0.1.4.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

olyptik-0.1.4-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file olyptik-0.1.4.tar.gz.

File metadata

  • Download URL: olyptik-0.1.4.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for olyptik-0.1.4.tar.gz
Algorithm Hash digest
SHA256 67f91989e80a88d370cfdf73ebf58dd6737fedbf2d0ee8e7abfdc532d723a77d
MD5 d190f1278ed4bea240d5772ceb46e000
BLAKE2b-256 831d1a0052525909927edd964a48c341e32e7bc2b85dbb406145e78fd247047a

See more details on using hashes here.

File details

Details for the file olyptik-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: olyptik-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for olyptik-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0ad3814a928fe6cfa6352cb3d1369c9b36cfbc46460f1db8e8c9332d3a025a9f
MD5 4744557201a7e26eb25e1359b9504496
BLAKE2b-256 2cea9436f86218d123a59005df112d0d49934fd1f048816e77a64a931726b7a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page