Skip to main content

LLM integration for Scrapy

Project description

Scrapy-LLM

LLM integration for scrapy as a middleware.

view - Documentation    

GitHub Actions

Installation

pip install scrapy-llm

Usage

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy_llm.handler.LlmExtractorMiddleware': 543,
    ...
}

then access extracted data from response object

# spider.py
def parse(self, response):
    extracted_data: Dict[str, Any] = response.request.meta.get('llm_extracted_data')
    ...

Configuration

All aspects of the middleware can be configured using the settings.py file except the API key which should be set as the environment variable OPENAI_API_KEY according to the openai api documentation here.

LLM_RESPONSE_MODEL

  • type: str
  • required: True

the response model to use for extracting data from the web page text.

RESPONSE_MODEL = 'scraper.models.ResponseModel'

this setting can also be set as an attribute on the spider class itself, in that case the class should be used directly instead of a string path to the class.

class MySpider(scrapy.Spider):
    response_model = ResponseModel
    ...

LLM_UNWRAP_NESTED

  • type: bool
  • required: False
  • default: True

whether to unwrap nested models in the extracted data.

LLM_UNWRAP_NESTED = True

for example if the following model is used

class ContactInfo(BaseModel):
    phone: str

class Person(BaseModel):
    name: str
    contact_info: ContactInfo

the extracted data will be unwrapped to

{
    "name": "John Doe",
    "phone": "1234567890"
}

without unwrapping the data will be

{
    "name": "John Doe",
    "contact_info": {
        "phone": "1234567890"
    }
}

LLM_API_BASE

base url for the openai compatible api.

LLM_API_BASE = 'https://api.openai.com/v1'

LLM_MODEL

  • type: str
  • required: False
  • default: "gpt-4-turbo"

the language model to use for extracting data from the web page text.

LLM_MODEL = 'gpt-4-turbo'

LLM_MODEL_TEMPERATURE

  • type: float
  • required: False
  • default: 0.0001

the temperature to use for the language model.

LLM_MODEL_TEMPERATURE = 0.0001

LLM_SYSTEM_MESSAGE

  • type: str
  • required: False
  • default: You are a data extraction expert, your role is to extract data from the given text according to the provided schema. make sure your output is a valid JSON object.

the system message to use for the language model.

LLM_SYSTEM_MESSAGE = '...'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_llm-0.1.7.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_llm-0.1.7-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_llm-0.1.7.tar.gz.

File metadata

  • Download URL: scrapy_llm-0.1.7.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.7.tar.gz
Algorithm Hash digest
SHA256 8b7dea92ac41c35c34ddc8a03a3a92e9290e5aaf87d38c294cd1393f5cad75b5
MD5 665e05dd06e8b30215565f2198134568
BLAKE2b-256 a8a805ace6b87698c97255f7913883a3d4f7d55b176c1932bb73bcaccd9f3e93

See more details on using hashes here.

File details

Details for the file scrapy_llm-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: scrapy_llm-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 f26cb5d0488f61e6ed89122a67a125c07672157c743de2b549569ca357c3ea9f
MD5 c20c448cb3c52c97bea93d4d038fd500
BLAKE2b-256 e6b786285b6ef14663d74ea9249d46458a9336b6f23a10b745d63ff1901a7be6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page