LLM integration for Scrapy
Project description
Scrapy-LLM
LLM integration for scrapy as a middleware.
Installation
pip install scrapy-llm
Usage
# settings.py
# set the response model to use for extracting data to a pydantic model (required)
# or set it as an attribute on the spider class as response_model
LLM_RESPONSE_MODEL = 'scraper.models.ResponseModel'
# enable the middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_llm.handler.LlmExtractorMiddleware': 543,
...
}
then access extracted data from the response object.
# spider.py
def parse(self, response):
extracted_data: Dict[str, Any] = response.request.meta.get('llm_extracted_data')
...
Examples
the examples directory contains a sample scrapy project that uses the middleware to extract capacity data from university websites.
to run the example project, export your openai api key as an environment variable, in addition to any other settings you want to change.
export OPENAI_API_KEY=<your-api-key>
then run the example project using the following command
cd examples
scrapy crawl generic -a urls_file=urls.csv
add more urls to the urls.csv file to extract data from more websites.
Configuration
All aspects of the middleware can be configured using the settings.py file except the API key which should be set as the environment variable OPENAI_API_KEY according to the openai api documentation here.
LLM_RESPONSE_MODEL
- type: str
- required: True
the response model to use for extracting data from the web page text.
RESPONSE_MODEL = 'scraper.models.ResponseModel'
this setting can also be set as an attribute on the spider class itself, in that case the class should be used directly instead of a string path to the class.
class MySpider(scrapy.Spider):
response_model = ResponseModel
...
LLM_UNWRAP_NESTED
- type: bool
- required: False
- default: True
whether to unwrap nested models in the extracted data.
LLM_UNWRAP_NESTED = True
for example if the following model is used
class ContactInfo(BaseModel):
phone: str
class Person(BaseModel):
name: str
contact_info: ContactInfo
the extracted data will be unwrapped to
{
"name": "John Doe",
"phone": "1234567890"
}
without unwrapping the data will be
{
"name": "John Doe",
"contact_info": {
"phone": "1234567890"
}
}
LLM_API_BASE
- type: str
- required: False
- default: https://api.openai.com/v1
base url for the openai compatible api.
LLM_API_BASE = 'https://api.openai.com/v1'
LLM_MODEL
- type: str
- required: False
- default: "gpt-4-turbo"
the language model to use for extracting data from the web page text.
LLM_MODEL = 'gpt-4-turbo'
LLM_MODEL_TEMPERATURE
- type: float
- required: False
- default: 0.0001
the temperature to use for the language model.
LLM_MODEL_TEMPERATURE = 0.0001
LLM_SYSTEM_MESSAGE
- type: str
- required: False
- default: You are a data extraction expert, your role is to extract data from the given text according to the provided schema. make sure your output is a valid JSON object.
the system message to use for the language model.
LLM_SYSTEM_MESSAGE = '...'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_llm-0.1.8.tar.gz.
File metadata
- Download URL: scrapy_llm-0.1.8.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfc0053f0bfc1f06456664be6e82bb7d7fc14be1556e7f202c8a52e22d7c63a9
|
|
| MD5 |
6148bc4fa1562ea228a2587b8e86eeb4
|
|
| BLAKE2b-256 |
282710b0da0384372ec1b9de1e62f8b33533e7322f2cc776d8a3c1c27e03e036
|
File details
Details for the file scrapy_llm-0.1.8-py3-none-any.whl.
File metadata
- Download URL: scrapy_llm-0.1.8-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39dbbafdea41b732802c87392ad0f653e0ab05ad7cfc7ea3c79a9803445c9783
|
|
| MD5 |
2a6516ab95d097f1be446ba9f740c192
|
|
| BLAKE2b-256 |
ae4730a4b5b1937f4b7183454d7c4ce70d182598f7137f8ddd738d5db36c1000
|