Skip to main content

Scrapfly SDK for Scrapfly

Project description

Scrapfly SDK

Installation

pip install scrapfly-sdk

You can also install extra dependencies

  • pip install "scrapfly-sdk[seepdup]" for performance improvement
  • pip install "scrapfly-sdk[concurrency]" for concurrency out of the box (asyncio / thread)
  • pip install "scrapfly-sdk[scrapy]" for scrapy integration
  • pip install "scrapfly-sdk[all]" Everything!

For use of built-in HTML parser (via ScrapeApiResponse.selector property) additional requirement of either parsel or scrapy is required.

For reference of usage or examples, please checkout the folder /examples in this repository.

This SDK cover the following Scrapfly API endpoints:

Integrations

Scrapfly Python SDKs are integrated with LlamaIndex and LangChain. Both framework allows training Large Language Models (LLMs) using augmented context.

This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:

  • Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
  • Document Understanding and Extraction
  • Autonomous Agents that can perform research and take actions

In the context of web scraping, web page data can be extracted as Text or Markdown using Scrapfly's format feature to train LLMs with the scraped data.

LlamaIndex

Installation

Install llama-index, llama-index-readers-web, and scrapfly-sdk using pip:

pip install llama-index llama-index-readers-web scrapfly-sdk

Usage

Scrapfly is available at LlamaIndex as a data connector, known as a Reader. This reader is used to gather a web page data into a Document representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the LlamaIndex use cases for more.

import os

from llama_index.readers.web import ScrapflyReader
from llama_index.core import VectorStoreIndex

# Initiate ScrapflyReader with your Scrapfly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

# After creating the documents, train them with an LLM
# LlamaIndex uses OpenAI default, other options can be found at the examples direcotry: 
# https://docs.llamaindex.ai/en/stable/examples/llm/openai/

# Add your OpenAI key (a paid subscription must exist) from: https://platform.openai.com/api-keys/
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the dark energy potion is bold cherry cola."

The load_data function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:

from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"],
    scrape_config=scrapfly_scrape_config,  # Pass the scrape config
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

LangChain

Installation

Install langchain, langchain-community, and scrapfly-sdk using pip:

pip install langchain langchain-community scrapfly-sdk

Usage

Scrapfly is available at LangChain as a document loader, known as a Loader. This reader is used to gather a web page data into Document representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see LangChain tutorials for further use cases.

import os

from langchain import hub # pip install langchainhub
from langchain_chroma import Chroma # pip install langchain_chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI # pip install langchain_openai
from langchain_text_splitters import RecursiveCharacterTextSplitter # pip install langchain_text_splitters
from langchain_community.document_loaders import ScrapflyLoader


scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()

# This example uses OpenAI. For more see: https://python.langchain.com/v0.2/docs/integrations/platforms/
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"

# Create a retriever
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

model = ChatOpenAI()
prompt = hub.pull("rlm/rag-prompt")

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

response = rag_chain.invoke("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the Dark Energy Potion is bold cherry cola."

To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to the ScrapflyLoader:

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_scrape_config = {
    "asp": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool",  # Select a proxy pool (datacenter or residnetial)
    "country": "us",  # Select a proxy location
    "auto_scroll": True,  # Auto scroll the page
    "js": "",  # Execute custom JavaScript code by the headless browser
}

scrapfly_loader = ScrapflyLoader(
    ["https://web-scraping.dev/products"],
    api_key="Your Scrapfly API key",  # Get your API key from https://www.scrapfly.io/
    continue_on_failure=True,  # Ignore unprocessable web pages and log their exceptions
    scrape_config=scrapfly_scrape_config,  # Pass the scrape_config object
    scrape_format="markdown",  # The scrape result format, either `markdown`(default) or `text`
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)

Get Your API Key

You can create a free account on Scrapfly to get your API Key.

Migration

Migrate from 0.7.x to 0.8

asyncio-pool dependency has been dropped

scrapfly.concurrent_scrape is now an async generator. If the concurrency is None or not defined, the max concurrency allowed by your current subscription is used.

    async for result in scrapfly.concurrent_scrape(concurrency=10, scrape_configs=[ScrapConfig(...), ...]):
        print(result)

brotli args is deprecated and will be removed in the next minor. There is not benefit in most of case versus gzip regarding and size and use more CPU.

What's new

0.8.x

  • Better error log
  • Async/Improvement for concurrent scrape with asyncio
  • Scrapy media pipeline are now supported out of the box

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapfly-sdk-0.8.19.tar.gz (37.3 kB view details)

Uploaded Source

Built Distribution

scrapfly_sdk-0.8.19-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapfly-sdk-0.8.19.tar.gz.

File metadata

  • Download URL: scrapfly-sdk-0.8.19.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for scrapfly-sdk-0.8.19.tar.gz
Algorithm Hash digest
SHA256 fec3f83116a3b0270ce8574abbf166400d7a437101718b5537eef93193b2cf28
MD5 54ec48cb12b1f7ee259a0fe8fcc6ef1a
BLAKE2b-256 1447095d5c01a6e4f605d73c09289f9f630177b53ff5bf0eda02ac3bfae90c0e

See more details on using hashes here.

File details

Details for the file scrapfly_sdk-0.8.19-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapfly_sdk-0.8.19-py3-none-any.whl
Algorithm Hash digest
SHA256 7bb8fa10503a02f2f5981ccb4bd765b910be63a4fb9fd7a1d59c98b72d4ea29c
MD5 7155db5129d9b697d44202752ff8d49f
BLAKE2b-256 17f42a84419ea000c5c76c88c878ed9b08fe2e24bfa48be085586e144230c631

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page