Official Python SDK for Olyptik API
Project description
Olyptik Python SDK
The Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.
Installation
Install the SDK using pip:
pip install olyptik
Configuration
First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.
from olyptik import Olyptik
# Initialize with API key
client = Olyptik(api_key="your_api_key_here")
Synchronous Usage
Start a crawl
Minimal settings crawl:
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
Full example:
# Start a crawl
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50,
"maxDepth": 2,
"engineType": "auto",
"includeLinks": True,
"timeout": 60,
"useSitemap": False,
"entireWebsite": False,
"excludeNonMainTags": True,
"deduplicateContent": True,
"extraction": "",
"useStaticIps": False
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
Query crawls
from olyptik import CrawlStatus
result = client.query_crawls({
"startUrls": ["https://example.com"],
"status": [CrawlStatus.SUCCEEDED],
"page": 0,
})
print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)
Getting Crawl Results
Retrieve the results of your crawl using the crawl ID. The results are paginated, and you can specify the page number and limit per page.
limit = 50
page = 0
results = client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Depth: {result.depthOfUrl}")
Abort a crawl
aborted_crawl = client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")
Get crawl logs
Retrieve logs for a specific crawl to monitor its progress and debug issues:
page = 1
limit = 1200
logs = client.get_crawl_logs(crawl.id, page, limit)
for log in logs.results:
print(f"[{log.level}] {log.message}: {log.description}")
Scrape multiple URLs
Scrape up to 30 URLs at once without following links:
scrape_response = client.scrape({
"urls": ["https://example.com", "https://example.com/about"],
"includeLinks": True,
"excludeNonMainTags": True,
"deduplicateContent": True,
"extraction": "",
"timeout": 5,
"engineType": "auto",
"useStaticIps": False
})
for result in scrape_response.results:
if result.isSuccess:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Links found: {len(result.links)}")
else:
print(f"Failed to scrape {result.url}: {result.errorMessage}")
Asynchronous Usage
For better performance with I/O operations, use the async client:
Start a crawl
Minimal settings crawl:
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
asyncio.run(main())
Full example:
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# Start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50,
"maxDepth": 2,
"engineType": "auto",
"includeLinks": True,
"timeout": 60,
"useSitemap": False,
"entireWebsite": False,
"deduplicateContent": True,
"excludeNonMainTags": True,
"extraction": "",
"useStaticIps": False
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
asyncio.run(main())
Query crawls
import asyncio
from olyptik import AsyncOlyptik, CrawlStatus
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
result = await client.query_crawls({
"startUrls": ["https://example.com"],
"status": [CrawlStatus.SUCCEEDED],
"page": 0,
})
print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)
asyncio.run(main())
Get crawl results
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# First start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
# Get crawl results
limit = 50
page = 0
results = await client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Depth: {result.depthOfUrl}")
asyncio.run(main())
Abort a crawl
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# First start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
# Abort the crawl
aborted_crawl = await client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")
asyncio.run(main())
Get crawl logs
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# First start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
# Get crawl logs
page = 1
limit = 1200
logs = await client.get_crawl_logs(crawl.id, page, limit)
for log in logs.results:
print(f"[{log.level}] {log.message}: {log.description}")
asyncio.run(main())
Scrape multiple URLs
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
scrape_response = await client.scrape({
"urls": ["https://example.com", "https://example.com/about"],
"includeLinks": True,
"excludeNonMainTags": True,
"deduplicateContent": True,
"extraction": "",
"timeout": 5,
"engineType": "auto",
"useStaticIps": False
})
for result in scrape_response.results:
if result.isSuccess:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Links found: {len(result.links)}")
else:
print(f"Failed to scrape {result.url}: {result.errorMessage}")
asyncio.run(main())
Configuration Options
StartCrawlPayload
The crawl configuration options available:
You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
| startUrl | string | ✅ | - | The URL to start crawling from |
| maxResults | number | ❌ | - | Maximum number of results to collect (1-5,000) |
| useSitemap | boolean | ❌ | false | Whether to use sitemap.xml to crawl the website |
| entireWebsite | boolean | ❌ | false | Whether to use sitemap.xml and all found links to crawl the website |
| maxDepth | number | ❌ | 10 | Maximum depth of pages to crawl (1-100) |
| includeLinks | boolean | ❌ | true | Whether to include links in the crawl results' markdown |
| excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results |
| deduplicateContent | boolean | ❌ | true | Remove duplicate content from markdown that appears on multiple pages |
| extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the crawl results |
| timeout | number | ❌ | 60 | Timeout duration in minutes |
| engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) |
| useStaticIps | boolean | ❌ | false | Whether to use static IPs for the crawl |
StartScrapePayload
The scrape configuration options available:
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
| urls | string[] | ✅ | - | Array of URLs to scrape (max 30 URLs) |
| includeLinks | boolean | ❌ | true | Whether to include links in the scrape results' markdown |
| excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the scrape results |
| deduplicateContent | boolean | ❌ | true | Remove duplicate content from markdown that appears in multiple scraped pages |
| extraction | string | ❌ | "" | Instructions defining how the AI should extract specific content from the scrape results |
| timeout | number | ❌ | 5 | Timeout duration in minutes |
| engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) |
| useStaticIps | boolean | ❌ | false | Whether to use static IPs for the scrape |
Engine Types
Choose the appropriate engine for your crawling needs:
from olyptik import EngineType
# Available engine types
EngineType.AUTO # Automatically choose the best engine
EngineType.PLAYWRIGHT # Use Playwright for JavaScript-heavy sites
EngineType.CHEERIO # Use Cheerio for faster, static content crawling
Crawl Status
Monitor your crawl status using the CrawlStatus enum:
from olyptik import CrawlStatus
# Possible status values
CrawlStatus.RUNNING # Crawl is currently running
CrawlStatus.SUCCEEDED # Crawl completed successfully
CrawlStatus.FAILED # Crawl failed due to an error
CrawlStatus.TIMED_OUT # Crawl exceeded timeout limit
CrawlStatus.ABORTED # Crawl was manually aborted
CrawlStatus.ERROR # Crawl encountered an error
Crawl Log Level
Monitor log levels using the CrawlLogLevel enum:
from olyptik import CrawlLogLevel
# Possible log levels
CrawlLogLevel.INFO # Informational messages
CrawlLogLevel.DEBUG # Debug messages
CrawlLogLevel.WARN # Warning messages
CrawlLogLevel.ERROR # Error messages
Error Handling
The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:
from olyptik import Olyptik, ApiError
client = Olyptik(api_key="your_api_key_here")
try:
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 10
})
except ApiError as e:
# API returned an error response
print(f"API Error: {e.message}")
print(f"Status Code: {e.status_code}")
Data Models
CrawlResult
Each crawl result contains:
@dataclass
class CrawlResult:
crawlId: str # Unique identifier for the crawl
teamId: str # Team identifier
url: str # The crawled URL
title: str # Page title
markdown: str # Extracted content in markdown format
depthOfUrl: int # How deep this URL was in the crawl
createdAt: str # When the result was created
Crawl
Crawl metadata includes:
@dataclass
class Crawl:
id: str # Unique crawl identifier
status: CrawlStatus # Current status
startUrls: List[str] # Starting URLs
includeLinks: bool # Whether links are included
maxDepth: int # Maximum crawl depth
maxResults: int # Maximum number of results
teamId: str # Team identifier
createdAt: str # Creation timestamp
completedAt: Optional[str] # Completion timestamp
durationInSeconds: int # Total duration
totalPages: int # Number of results found
useSitemap: bool # Whether sitemap was used
entireWebsite: Optional[bool] # Whether to use both sitemap and all found links
deduplicateContent: bool # Remove duplicate content from markdown that appears on multiple pages |
extraction: Optional[str]
excludeNonMainTags: bool # Whether non-main HTML tags were excluded
timeout: int # Timeout setting
useStaticIps: bool # Whether static IPs were used
engineType: EngineType # Engine type used
CrawlLog
Each crawl log entry contains:
@dataclass
class CrawlLog:
id: str # Unique log identifier
message: str # Log message
level: CrawlLogLevel # Log level (info, debug, warn, error)
description: str # Detailed description
crawlId: str # Crawl identifier
teamId: Optional[str] # Team identifier
data: Optional[Dict[str, Any]] # Additional log data
createdAt: Optional[str] # Creation timestamp
ScrapeResponse
The response from a scrape operation:
@dataclass
class ScrapeResponse:
id: str # Unique scrape identifier
teamId: str # Team identifier
projectId: str # Project identifier
results: List[UrlResult] # Array of scrape results
timeout: int # Timeout in minutes
origin: str # Origin of the scrape ("api" or "web")
createdAt: str # Creation timestamp
updatedAt: str # Last update timestamp
UrlResult
Each URL scrape result contains:
@dataclass
class UrlResult:
url: str # The URL that was scraped
isSuccess: bool # Whether the scrape was successful
title: str # Page title
markdown: str # Extracted content in markdown format
links: List[str] # Links found on the page
duplicatesRemovedCount: Optional[int] # Number of duplicate content blocks removed
errorCode: Optional[int] # Error code if the scrape failed
errorMessage: Optional[str] # Error message if the scrape failed
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file olyptik-0.1.6.tar.gz.
File metadata
- Download URL: olyptik-0.1.6.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8d00182d4d18b4fed24ba3723b76cc82776a0da211c56d3a3a1a2b5bbe1a80b
|
|
| MD5 |
289021f1a701ef4c7857d113c67930f4
|
|
| BLAKE2b-256 |
7cc0d5f3a394e000ddd4afa38fcd3f915a1586352b1526d2b05f7abdeda643b9
|
File details
Details for the file olyptik-0.1.6-py3-none-any.whl.
File metadata
- Download URL: olyptik-0.1.6-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
817f59692aeabd08a200a6908c11a2b1b1b3070bff9bf93a6a41fdec848a9b40
|
|
| MD5 |
04cc3f3ed24fdf1bb863ee0951705e1f
|
|
| BLAKE2b-256 |
cf2582b8a4375464d2960f9c6bb27249d518f0fcd36d69a45be6ba73d4dadc90
|