Skip to main content

LLM integration for Scrapy

Project description

Scrapy-LLM

LLM integration for scrapy as a middleware.

view - Documentation    

GitHub Actions

Installation

pip install scrapy-llm

Usage

# settings.py

# set the response model to use for extracting data to a pydantic model (required)
# or set it as an attribute on the spider class as response_model
LLM_RESPONSE_MODEL = 'scraper.models.ResponseModel'

# enable the middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy_llm.handler.LlmExtractorMiddleware': 543,
    ...
}

then access extracted data from the response object.

# spider.py
def parse(self, response):
    extracted_data: Dict[str, Any] = response.request.meta.get('llm_extracted_data')
    ...

Examples

the examples directory contains a sample scrapy project that uses the middleware to extract capacity data from university websites.

to run the example project, export your openai api key as an environment variable, in addition to any other settings you want to change.

export OPENAI_API_KEY=<your-api-key>

then run the example project using the following command

cd examples
scrapy crawl generic -a urls_file=urls.csv

add more urls to the urls.csv file to extract data from more websites.

Configuration

All aspects of the middleware can be configured using the settings.py file except the API key which should be set as the environment variable OPENAI_API_KEY according to the openai api documentation here.

LLM_RESPONSE_MODEL

  • type: str
  • required: True

the response model to use for extracting data from the web page text.

RESPONSE_MODEL = 'scraper.models.ResponseModel'

this setting can also be set as an attribute on the spider class itself, in that case the class should be used directly instead of a string path to the class.

class MySpider(scrapy.Spider):
    response_model = ResponseModel
    ...

LLM_UNWRAP_NESTED

  • type: bool
  • required: False
  • default: True

whether to unwrap nested models in the extracted data.

LLM_UNWRAP_NESTED = True

for example if the following model is used

class ContactInfo(BaseModel):
    phone: str

class Person(BaseModel):
    name: str
    contact_info: ContactInfo

the extracted data will be unwrapped to

{
    "name": "John Doe",
    "phone": "1234567890"
}

without unwrapping the data will be

{
    "name": "John Doe",
    "contact_info": {
        "phone": "1234567890"
    }
}

LLM_API_BASE

base url for the openai compatible api.

LLM_API_BASE = 'https://api.openai.com/v1'

LLM_MODEL

  • type: str
  • required: False
  • default: "gpt-4-turbo"

the language model to use for extracting data from the web page text.

LLM_MODEL = 'gpt-4-turbo'

LLM_MODEL_TEMPERATURE

  • type: float
  • required: False
  • default: 0.0001

the temperature to use for the language model.

LLM_MODEL_TEMPERATURE = 0.0001

LLM_SYSTEM_MESSAGE

  • type: str
  • required: False
  • default: You are a data extraction expert, your role is to extract data from the given text according to the provided schema. make sure your output is a valid JSON object.

the system message to use for the language model.

LLM_SYSTEM_MESSAGE = '...'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_llm-0.1.8.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_llm-0.1.8-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_llm-0.1.8.tar.gz.

File metadata

  • Download URL: scrapy_llm-0.1.8.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.8.tar.gz
Algorithm Hash digest
SHA256 bfc0053f0bfc1f06456664be6e82bb7d7fc14be1556e7f202c8a52e22d7c63a9
MD5 6148bc4fa1562ea228a2587b8e86eeb4
BLAKE2b-256 282710b0da0384372ec1b9de1e62f8b33533e7322f2cc776d8a3c1c27e03e036

See more details on using hashes here.

File details

Details for the file scrapy_llm-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: scrapy_llm-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 39dbbafdea41b732802c87392ad0f653e0ab05ad7cfc7ea3c79a9803445c9783
MD5 2a6516ab95d097f1be446ba9f740c192
BLAKE2b-256 ae4730a4b5b1937f4b7183454d7c4ce70d182598f7137f8ddd738d5db36c1000

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page