Skip to main content

Python SDK for Spider Cloud API

Project description

Spider Cloud Python SDK

The Spider Cloud Python SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.

Installation

To install the Spider Cloud Python SDK, you can use pip:

pip install spider_client

Usage

  1. Get an API key from spider.cloud
  2. Set the API key as an environment variable named SPIDER_API_KEY or pass it as a parameter to the Spider class.

Here's an example of how to use the SDK:

from spider import Spider

# Initialize the Spider with your API key
app = Spider(api_key='your_api_key')

# Scrape a single URL
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)

# Crawl a website
crawler_params = {
    'limit': 1,
    'proxy_enabled': True,
    'metadata': False,
    'request': 'http'
}
crawl_result = app.crawl_url(url, params=crawler_params)

Scraping a URL

To scrape data from a single URL:

url = 'https://example.com'
scraped_data = app.scrape_url(url)

Crawling a Website

To automate crawling a website:

url = 'https://example.com'
crawl_params = {
    'limit': 200,
    'request': 'smart_mode'
}
crawl_result = app.crawl_url(url, params=crawl_params)

Crawl Streaming

Stream crawl the website in chunks to scale.

    def handle_json(json_obj: dict) -> None:
        assert json_obj["url"] is not None

    url = 'https://example.com'
    crawl_params = {
        'limit': 200,
    }
    response = app.crawl_url(
        url,
        params=params,
        stream=True,
        callback=handle_json,
    )

Search

Perform a search for websites to crawl or gather search results:

query = 'a sports website'
crawl_params = {
    'request': 'smart_mode',
    'search_limit': 5,
    'limit': 5,
    'fetch_page_content': True
}
crawl_result = app.search(query, params=crawl_params)

Retrieving Links from a URL(s)

Extract all links from a specified URL:

url = 'https://example.com'
links = app.links(url)

Transform

Transform HTML to markdown or text lightning fast:

data = [ { 'html': '<html><body><h1>Hello world</h1></body></html>' } ]
params = {
    'readability': False,
    'return_format': 'markdown',
}
result = app.transform(data, params=params)

Taking Screenshots of a URL(s)

Capture a screenshot of a given URL:

url = 'https://example.com'
screenshot = app.screenshot(url)

Checking Available Credits

You can check the remaining credits on your account:

credits = app.get_credits()

Streaming

If you need to stream the request use the third param:

url = 'https://example.com'

crawler_params = {
    'limit': 1,
    'proxy_enabled': True,
    'metadata': False,
    'request': 'http'
}

links = app.links(url, crawler_params, True)

Content-Type

The following Content-type headers are supported using the fourth param:

  1. application/json
  2. text/csv
  3. application/xml
  4. application/jsonl
url = 'https://example.com'

crawler_params = {
    'limit': 1,
    'proxy_enabled': True,
    'metadata': False,
    'request': 'http'
}

# stream json lines back to the client
links = app.crawl(url, crawler_params, True, "application/jsonl")

Error Handling

The SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.

Contributing

Contributions to the Spider Cloud Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

License

The Spider Cloud Python SDK is open-source and released under the MIT License.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spider_client-0.1.82.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spider_client-0.1.82-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file spider_client-0.1.82.tar.gz.

File metadata

  • Download URL: spider_client-0.1.82.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for spider_client-0.1.82.tar.gz
Algorithm Hash digest
SHA256 1f0174ae5c6ac39cdde637b0840d1aec385e467bbe0a74e33d62b301a362c5b2
MD5 5ec97f6fe3e4f856ae50ab7d7ff026ec
BLAKE2b-256 9bdad416807a13c07d3002ded1ac535ffad3ad984024026e260ca0b4ee02ba00

See more details on using hashes here.

File details

Details for the file spider_client-0.1.82-py3-none-any.whl.

File metadata

  • Download URL: spider_client-0.1.82-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for spider_client-0.1.82-py3-none-any.whl
Algorithm Hash digest
SHA256 3582efcf730fd83d823969b44ebb31ed8197e0040f276a9703a6e98b8ebf8de8
MD5 2b7ee0d15eecda4a7e4289afe6391e9c
BLAKE2b-256 275aa886159c56c25c6d3df979e7c0d74a33632a4c2e3e72b97373119f1543a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page