scrapeless python sdk
Project description
Scrapeless Python SDK
The official Python SDK for Scrapeless AI - End-to-End Data Infrastructure for AI Developers & Enterprises.
📑 Table of Contents
- 🌟 Features
- 📦 Installation
- 🚀 Quick Start
- 📖 Usage Examples
- 🔧 API Reference
- 📚 Examples
- 📄 License
- 📞 Support
- 🏢 About Scrapeless
🌟 Features
- Browser: Advanced browser session management supporting Playwright and pyppeteer frameworks, with configurable anti-detection capabilities (e.g., fingerprint spoofing, CAPTCHA solving) and extensible automation workflows.
- Universal Scraping API: web interaction and data extraction with full browser capabilities. Execute JavaScript rendering, simulate user interactions (clicks, scrolls), bypass anti-scraping measures, and export structured data in formats.
- Crawl: Extract data from single pages or traverse entire domains, exporting in formats including Markdown, JSON, HTML, screenshots, and links.
- Scraping API: Direct data extraction APIs for websites (e.g., e-commerce, travel platforms). Retrieve structured product information, pricing, and reviews with pre-built connectors.
- Deep SerpApi: Google SERP data extraction API. Fetch organic results, news, images, and more with customizable parameters and real-time updates.
- Proxies: Geo-targeted proxy network with 195+ countries. Optimize requests for better success rates and regional data access.
- Actor: Deploy custom crawling and data processing workflows at scale with built-in scheduling and resource management.
- Storage Solutions: Scalable data storage solutions for crawled content, supporting seamless integration with cloud services and databases.
📦 Installation
Install the SDK using pip:
pip install scrapeless
🚀 Quick Start
Prerequisite
Log in to the Scrapeless Dashboard and get the API Key
Basic Setup
from scrapeless import Scrapeless
client = Scrapeless({
'api_key': 'your-api-key' # Get your API key from https://scrapeless.com
})
Environment Variables
You can also configure the SDK using environment variables:
# Required
SCRAPELESS_API_KEY=your-api-key
# Optional - Custom API endpoints
SCRAPELESS_BASE_API_URL=https://api.scrapeless.com
SCRAPELESS_ACTOR_API_URL=https://actor.scrapeless.com
SCRAPELESS_STORAGE_API_URL=https://storage.scrapeless.com
SCRAPELESS_BROWSER_API_URL=https://browser.scrapeless.com
SCRAPELESS_CRAWL_API_URL=https://api.scrapeless.com
📖 Usage Examples
Browser
Advanced browser session management supporting Playwright and Pyppeteer frameworks, with configurable anti-detection capabilities (e.g., fingerprint spoofing, CAPTCHA solving) and extensible automation workflows:
from scrapeless import Scrapeless
from scrapeless.types import ICreateBrowser
import pyppeteer
client = Scrapeless()
async def example():
# Create a browser session
config = ICreateBrowser(
session_name='sdk_test',
session_ttl=180,
proxy_country='US',
session_recording=True
)
session = client.browser.create(config).__dict__
browser_ws_endpoint = session['browser_ws_endpoint']
print('Browser WebSocket endpoint created:', browser_ws_endpoint)
# Connect to browser using pyppeteer
browser = await pyppeteer.connect({'browserWSEndpoint': browser_ws_endpoint})
# Open new page and navigate to website
page = await browser.newPage()
await page.goto('https://www.scrapeless.com')
Crawl
Extract data from single pages or traverse entire domains, exporting in formats including Markdown, JSON, HTML, screenshots, and links.
from scrapeless import Scrapeless
client = Scrapeless()
result = client.scraping_crawl.scrape_url("https://example.com")
print(result)
Scraping API
Direct data extraction APIs for websites (e.g., e-commerce, travel platforms). Retrieve structured product information, pricing, and reviews with pre-built connectors:
from scrapeless import Scrapeless
from scrapeless.types import ScrapingTaskRequest
client = Scrapeless()
request = ScrapingTaskRequest(
actor='scraper.google.search',
input={'q': 'nike site:www.nike.com'}
)
result = client.scraping.scrape(request=request)
print(result)
Deep SerpApi
Google SERP data extraction API. Fetch organic results, news, images, and more with customizable parameters and real-time updates:
from scrapeless import Scrapeless
from scrapeless.types import ScrapingTaskRequest
client = Scrapeless()
request = ScrapingTaskRequest(
actor='scraper.google.search',
input={'q': 'nike site:www.nike.com'}
)
result = client.deepserp.scrape(request=request)
print(result)
Actor
Deploy custom crawling and data processing workflows at scale with built-in scheduling and resource management:
from scrapeless import Scrapeless
from scrapeless.types import IRunActorData, IActorRunOptions
client = Scrapeless()
data = IRunActorData(
input={'url': 'https://example.com'},
run_options=IActorRunOptions(
CPU=2,
memory=2048,
timeout=600,
)
)
run = client.actor.run(
actor_id='your_actor_id',
data=data
)
print('Actor run result:', run)
Error Handling
The SDK throws ScrapelessError for API-related errors:
from scrapeless import Scrapeless, ScrapelessError
client = Scrapeless()
try:
result = client.scraping.scrape({'url': 'invalid-url'})
except ScrapelessError as error:
print(f"Scrapeless API error: {error}")
if hasattr(error, 'status_code'):
print(f"Status code: {error.status_code}")
🔧 API Reference
Client Configuration
from scrapeless.types import ScrapelessConfig
config = ScrapelessConfig(
api_key='', # Your api key
timeout=30000, # Request timeout in milliseconds (default: 30000)
base_api_url='', # Base API URL
actor_api_url='', # Actor service URL
storage_api_url='', # Storage service URL
browser_api_url='', # Browser service URL
scraping_crawl_api_url='' # Crawl service URL
)
Available Services
The SDK provides the following services through the main client:
client.browser- browser automation with Playwright/Pyppeteer support, anti-detection tools (fingerprinting, CAPTCHA solving), and extensible workflows.client.universal- JS rendering, user simulation (clicks/scrolls), anti-block bypass, and structured data export.client.scraping_crawl- Recursive site crawling with multi-format export (Markdown, JSON, HTML, screenshots, links).client.scraping- Pre-built connectors for sites (e.g., e-commerce, travel) to extract product data, pricing, and reviews.client.deepserp- Search engine results extractionclient.proxies- Proxy managementclient.actor- Scalable workflow automation with built-in scheduling and resource management.client.storage- Data storage solutions
📚 Examples
Check out the examples directory for comprehensive usage examples:
- Browser
- Playwright Integration
- Pyppeteer Integration
- Scraping API
- Actor
- Storage Usage
- Proxies
- Deep SerpApi
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
📞 Support
- 📖 Documentation: https://docs.scrapeless.com
- 💬 Community: Join our Discord
- 🐛 Issues: GitHub Issues
- 📧 Email: support@scrapeless.com
🏢 About Scrapeless
Scrapeless is a powerful web scraping and browser automation platform that helps businesses extract data from any website at scale. Our platform provides:
- High-performance web scraping infrastructure
- Global proxy network
- Browser automation capabilities
- Enterprise-grade reliability and support
Visit scrapeless.com to learn more and get started.
Made with ❤️ by the Scrapeless team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapeless-1.2.1.tar.gz.
File metadata
- Download URL: scrapeless-1.2.1.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
152aa452a157ef4d5dcdfe610cf10f8e76f8c0804823a4a731933aff36b0984b
|
|
| MD5 |
8cfe87f616e1917d996a2ad65d9e13cd
|
|
| BLAKE2b-256 |
64fe8a1f2c10bd65c287d27a7c1d2dc43598ce8ccbc53f868c7d2e22b1524bcd
|
Provenance
The following attestation bundles were made for scrapeless-1.2.1.tar.gz:
Publisher:
publish.yml on scrapeless-ai/sdk-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapeless-1.2.1.tar.gz -
Subject digest:
152aa452a157ef4d5dcdfe610cf10f8e76f8c0804823a4a731933aff36b0984b - Sigstore transparency entry: 304992540
- Sigstore integration time:
-
Permalink:
scrapeless-ai/sdk-python@46aab6f7d175484507c4183b6fbaac5fbeb30fad -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/scrapeless-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@46aab6f7d175484507c4183b6fbaac5fbeb30fad -
Trigger Event:
release
-
Statement type:
File details
Details for the file scrapeless-1.2.1-py3-none-any.whl.
File metadata
- Download URL: scrapeless-1.2.1-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67d8f33ceee4c59f3d5725466152ff52718f88b6aa27d5df1c3f8f41063b8ca6
|
|
| MD5 |
ee9f516b8a33827a2b24c8edc58080b7
|
|
| BLAKE2b-256 |
005fe512f9b06759ce1c130a7b8d6a7db87ad4906f813c6138484815ea23bc50
|
Provenance
The following attestation bundles were made for scrapeless-1.2.1-py3-none-any.whl:
Publisher:
publish.yml on scrapeless-ai/sdk-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapeless-1.2.1-py3-none-any.whl -
Subject digest:
67d8f33ceee4c59f3d5725466152ff52718f88b6aa27d5df1c3f8f41063b8ca6 - Sigstore transparency entry: 304992556
- Sigstore integration time:
-
Permalink:
scrapeless-ai/sdk-python@46aab6f7d175484507c4183b6fbaac5fbeb30fad -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/scrapeless-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@46aab6f7d175484507c4183b6fbaac5fbeb30fad -
Trigger Event:
release
-
Statement type: