Skip to main content

llama-index tools to use ScraperAPI web scraping

Project description

ScraperAPI LlamaIndex Tools Integration

This tool connects to ScraperAPI, a web scraping API that handles proxies, browsers, and CAPTCHAs, enabling your LlamaIndex agent to scrape web pages and extract structured data from Amazon, Google, eBay, Walmart, and Redfin.

Installation

pip install llama-index-tools-scraperapi

Usage

import asyncio
import os
from llama_index.tools.scraperapi import ScraperAPIToolSpec
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

async def main():
    scraper_tool = ScraperAPIToolSpec(
        api_key=os.environ["SCRAPERAPI_API_KEY"],
    )
    agent = FunctionAgent(
        tools=scraper_tool.to_tool_list(),
        llm=OpenAI(model="gpt-4.1"),
    )

    response = await agent.run(
        "Scrape https://example.com and summarize the content"
    )
    print(response)

asyncio.run(main())

Scrape a Web Page

from llama_index.tools.scraperapi import ScraperAPIToolSpec

tool = ScraperAPIToolSpec(api_key=os.environ["SCRAPERAPI_API_KEY"])

# Returns markdown content by default
content = tool.scrape("https://example.com")
print(content)

# Get plain text instead
content = tool.scrape("https://example.com", output_format="text")

# Enable JS rendering for dynamic pages
content = tool.scrape("https://example.com", render=True)

Amazon

# Product details by ASIN
product = tool.amazon_product(asin="B07FTKQ97Q")

# Search products
results = tool.amazon_search(query="wireless headphones")

# All seller offers for a product
offers = tool.amazon_offers(asin="B07FTKQ97Q")

Google

# Web search (structured SERP)
results = tool.google_search(query="Python web scraping tutorial")

# Shopping results
shopping = tool.google_shopping(query="laptop")

# News articles
news = tool.google_news(query="AI", tbs="w")  # past week

# Maps / places search
places = tool.google_maps_search(query="pizza", latitude=40.7128, longitude=-74.0060)

# Job listings
jobs = tool.google_jobs(query="python developer", gl="us")

eBay

# Product details
product = tool.ebay_product(product_id="166619046796")

# Search with filters
results = tool.ebay_search(query="vintage watch", sort_by="price_lowest", condition="used")

Walmart

# Product details
product = tool.walmart_product(product_id="5253396052")

# Search
results = tool.walmart_search(query="laptop")

# Browse category
items = tool.walmart_category(category="3944_1089430_37807")

# Product reviews
reviews = tool.walmart_reviews(product_id="5253396052", sort="helpful")

Redfin

# Search listings
listings = tool.redfin_search(url="https://www.redfin.com/city/30749/CA/San-Francisco")

# Agent details
agent = tool.redfin_agent(url="https://www.redfin.com/real-estate-agents/agent-name")

# For-sale listing
listing = tool.redfin_forsale(url="https://www.redfin.com/CA/San-Francisco/123-Main-St")

# For-rent listing
rental = tool.redfin_forrent(url="https://www.redfin.com/CA/San-Francisco/456-Oak-Ave")

Geo-targeted Scraping

tool = ScraperAPIToolSpec(
    api_key=os.environ["SCRAPERAPI_API_KEY"],
    country_code="uk",
)

# All requests will use UK proxies by default
content = tool.scrape("https://example.co.uk")

# Override per request
content = tool.scrape("https://example.de", country_code="de")

Available Tools

Scraping:

  • scrape: Scrape any web page and return content as markdown, text, or JSON.

Amazon (Structured Data):

  • amazon_product: Get product details by ASIN.
  • amazon_search: Search Amazon products.
  • amazon_offers: Get all seller offers for a product.

Google (Structured Data):

  • google_search: Google SERP search results.
  • google_shopping: Google Shopping product results.
  • google_news: Google News articles.
  • google_maps_search: Google Maps places search.
  • google_jobs: Google Jobs listings.

eBay (Structured Data):

  • ebay_product: Get product details by product ID.
  • ebay_search: Search eBay listings.

Redfin (Structured Data):

  • redfin_search: Search Redfin listings.
  • redfin_agent: Get agent profile details.
  • redfin_forsale: Get for-sale listing details.
  • redfin_forrent: Get for-rent listing details.

Walmart (Structured Data):

  • walmart_product: Get product details by product ID.
  • walmart_search: Search Walmart products.
  • walmart_category: Browse a Walmart category.
  • walmart_reviews: Get product reviews.

Error Handling

All API errors raise ScraperAPIError, so you can handle them specifically:

from llama_index.tools.scraperapi import ScraperAPIToolSpec, ScraperAPIError

tool = ScraperAPIToolSpec(api_key=os.environ["SCRAPERAPI_API_KEY"])

try:
    result = tool.scrape("https://example.com")
except ScraperAPIError as e:
    print(f"Scraping failed: {e}")

Configuration

Parameter Type Default Description
api_key str required ScraperAPI key
render bool False Enable JS rendering by default
country_code str None Default geo-targeting country code
device_type str None "desktop" or "mobile"
timeout int 70 Request timeout in seconds

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_tools_scraperapi-0.1.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llama_index_tools_scraperapi-0.1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file llama_index_tools_scraperapi-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_index_tools_scraperapi-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4369509b46685fc2443fa9b282c193d80112b6ecbf612eaa392be8f49836bba4
MD5 eccb91f6471908d4749ae9befaa3d2c0
BLAKE2b-256 dc0bed78192558547eecfc80e1419358015d5ff71adb7d873bf91c83bba27727

See more details on using hashes here.

File details

Details for the file llama_index_tools_scraperapi-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_index_tools_scraperapi-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 15915e18d051ed6885cf339f99e79272ea09f2f883c21c897cd78bd4db392920
MD5 4a3c26a09e030e9fc2a15994efa18b83
BLAKE2b-256 35226547dbd030ad3984452ee0d2367e3234cf60a00c9c3e02b7d40f8e9d5128

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page