Official Python SDK for Olyptik API
Project description
Olyptik Python SDK
The Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.
Installation
Install the SDK using pip:
pip install olyptik
Configuration
First, you'll need to initialize the SDK with your API key - you can get it from the settings page. You can either pass it directly or use environment variables.
from olyptik import Olyptik
# Initialize with API key
client = Olyptik(api_key="your_api_key_here")
Synchronous Usage
Start a crawl
Minimal settings crawl:
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
Full example:
# Start a crawl
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50,
"maxDepth": 2,
"engineType": "auto",
"includeLinks": True,
"timeout": 60,
"useSitemap": False,
"entireWebsite": False,
"excludeNonMainTags": True,
"useStaticIps": False
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
Query crawls
from olyptik import CrawlStatus
result = client.query_crawls({
"startUrls": ["https://example.com"],
"status": [CrawlStatus.SUCCEEDED],
"page": 0,
})
print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)
Getting Crawl Results
Retrieve the results of your crawl using the crawl ID. The results are paginated, and you can specify the page number and limit per page.
limit = 50
page = 0
results = client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Depth: {result.depthOfUrl}")
Abort a crawl
aborted_crawl = client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")
Asynchronous Usage
For better performance with I/O operations, use the async client:
Start a crawl
Minimal settings crawl:
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
asyncio.run(main())
Full example:
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# Start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50,
"maxDepth": 2,
"engineType": "auto",
"includeLinks": True,
"timeout": 60,
"useSitemap": False,
"entireWebsite": False,
"excludeNonMainTags": True,
"useStaticIps": False
})
print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
asyncio.run(main())
Query crawls
import asyncio
from olyptik import AsyncOlyptik, CrawlStatus
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
result = await client.query_crawls({
"startUrls": ["https://example.com"],
"status": [CrawlStatus.SUCCEEDED],
"page": 0,
})
print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)
asyncio.run(main())
Get crawl results
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# First start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
# Get crawl results
limit = 50
page = 0
results = await client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
print(f"URL: {result.url}")
print(f"Title: {result.title}")
print(f"Depth: {result.depthOfUrl}")
asyncio.run(main())
Abort a crawl
import asyncio
from olyptik import AsyncOlyptik
async def main():
async with AsyncOlyptik(api_key="your_api_key_here") as client:
# First start a crawl
crawl = await client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 50
})
# Abort the crawl
aborted_crawl = await client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")
asyncio.run(main())
Configuration Options
StartCrawlPayload
The crawl configuration options available:
You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
| startUrl | string | ✅ | - | The URL to start crawling from |
| maxResults | number | ❌ | - | Maximum number of results to collect (1-5,000) |
| useSitemap | boolean | ❌ | false | Whether to use sitemap.xml to crawl the website |
| entireWebsite | boolean | ❌ | false | Whether to use sitemap.xml and all found links to crawl the website |
| maxDepth | number | ❌ | 10 | Maximum depth of pages to crawl (1-100) |
| includeLinks | boolean | ❌ | true | Whether to include links in the crawl results' markdown |
| excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results |
| timeout | number | ❌ | 60 | Timeout duration in minutes |
| engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) |
| useStaticIps | boolean | ❌ | false | Whether to use static IPs for the crawl |
Engine Types
Choose the appropriate engine for your crawling needs:
from olyptik import EngineType
# Available engine types
EngineType.AUTO # Automatically choose the best engine
EngineType.PLAYWRIGHT # Use Playwright for JavaScript-heavy sites
EngineType.CHEERIO # Use Cheerio for faster, static content crawling
Crawl Status
Monitor your crawl status using the CrawlStatus enum:
from olyptik import CrawlStatus
# Possible status values
CrawlStatus.RUNNING # Crawl is currently running
CrawlStatus.SUCCEEDED # Crawl completed successfully
CrawlStatus.FAILED # Crawl failed due to an error
CrawlStatus.TIMED_OUT # Crawl exceeded timeout limit
CrawlStatus.ABORTED # Crawl was manually aborted
CrawlStatus.ERROR # Crawl encountered an error
Error Handling
The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:
from olyptik import Olyptik, ApiError
client = Olyptik(api_key="your_api_key_here")
try:
crawl = client.run_crawl({
"startUrl": "https://example.com",
"maxResults": 10
})
except ApiError as e:
# API returned an error response
print(f"API Error: {e.message}")
print(f"Status Code: {e.status_code}")
Data Models
CrawlResult
Each crawl result contains:
@dataclass
class CrawlResult:
crawlId: str # Unique identifier for the crawl
teamId: str # Team identifier
url: str # The crawled URL
title: str # Page title
markdown: str # Extracted content in markdown format
depthOfUrl: int # How deep this URL was in the crawl
createdAt: str # When the result was created
Crawl
Crawl metadata includes:
@dataclass
class Crawl:
id: str # Unique crawl identifier
status: CrawlStatus # Current status
startUrls: List[str] # Starting URLs
includeLinks: bool # Whether links are included
maxDepth: int # Maximum crawl depth
maxResults: int # Maximum number of results
teamId: str # Team identifier
createdAt: str # Creation timestamp
completedAt: Optional[str] # Completion timestamp
durationInSeconds: int # Total duration
numberOfResults: int # Number of results found
useSitemap: bool # Whether sitemap was used
entireWebsite: bool # Whether to use both sitemap and all found links
excludeNonMainTags: bool # Whether non-main HTML tags were excluded
timeout: int # Timeout setting
useStaticIps: bool # Whether static IPs were used
engineType: EngineType # Engine type used
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file olyptik-0.1.3.tar.gz.
File metadata
- Download URL: olyptik-0.1.3.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee3e72084ecb34a0de81a4bbc06d871513163ee6e313d24b2b08e94cbd708558
|
|
| MD5 |
df6065cdecdcc2748c51b80089c0e5dd
|
|
| BLAKE2b-256 |
e9bcb36de1a5c4d1114e2336cb49585cb702356207836e2655d36afd52b10e5f
|
File details
Details for the file olyptik-0.1.3-py3-none-any.whl.
File metadata
- Download URL: olyptik-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92e76576813a66d33766044ed3e9ba3ae2912115edfed30d9e81b4563c22e558
|
|
| MD5 |
ddc03c5e1bc5119e28882d95f625999d
|
|
| BLAKE2b-256 |
8cd32808fb67be476d836a5d531d103485eac01ef2304406d21cf180daf55622
|