LLM integration for Scrapy

These details have not been verified by PyPI

Project description

Scrapy-LLM

LLM integration for scrapy as a middleware. Extract any data from the web using your own predefined schema with your own preferred language model.

[ GitHub Workflow Status ](

Features

Extract data from web page text using a language model.
Define a schema for the extracted data using pydantic models.
Validate the extracted data against the defined schema.
Seamlessly integrate with any API compatible with the OpenAI API specification.
Use any language model deployed on an API compatible with the OpenAI API specification.

Installation

pip install scrapy-llm

Usage

# settings.py

# set the response model to use for extracting data to a pydantic model (required)
# or set it as an attribute on the spider class as response_model
LLM_RESPONSE_MODEL = 'scraper.models.ResponseModel'

# enable the middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy_llm.handler.LlmExtractorMiddleware': 543,
    ...
}

then access extracted data from the response object.

# spider.py
def parse(self, response):
    extracted_data: Dict[str, Any] = response.request.meta.get('llm_extracted_data')
    ...

Examples

the examples directory contains a sample scrapy project that uses the middleware to extract capacity data from university websites.

to run the example project, export your openai api key as an environment variable, in addition to any other settings you want to change.

export OPENAI_API_KEY=<your-api-key>

then run the example project using the following command

cd examples
scrapy crawl generic -a urls_file=urls.csv

add more urls to the urls.csv file to extract data from more websites.

Configuration

All aspects of the middleware can be configured using the settings.py file except the API key which should be set as the environment variable OPENAI_API_KEY according to the openai api documentation here.

`LLM_RESPONSE_MODEL`

type: str
required: True

the response model to use for extracting data from the web page text.

RESPONSE_MODEL = 'scraper.models.ResponseModel'

this setting can also be set as an attribute on the spider class itself, in that case the class should be used directly instead of a string path to the class.

class MySpider(scrapy.Spider):
    response_model = ResponseModel
    ...

`LLM_UNWRAP_NESTED`

type: bool
required: False
default: True

whether to unwrap nested models in the extracted data.

LLM_UNWRAP_NESTED = True

for example if the following model is used

class ContactInfo(BaseModel):
    phone: str

class Person(BaseModel):
    name: str
    contact_info: ContactInfo

the extracted data will be unwrapped to

{
    "name": "John Doe",
    "phone": "1234567890"
}

without unwrapping the data will be

{
    "name": "John Doe",
    "contact_info": {
        "phone": "1234567890"
    }
}

`LLM_API_BASE`

type: str
required: False
default: https://api.openai.com/v1

base url for the openai compatible api.

LLM_API_BASE = 'https://api.openai.com/v1'

`LLM_MODEL`

type: str
required: False
default: "gpt-4-turbo"

the language model to use for extracting data from the web page text.

LLM_MODEL = 'gpt-4-turbo'

`LLM_MODEL_TEMPERATURE`

type: float
required: False
default: 0.0001

the temperature to use for the language model.

LLM_MODEL_TEMPERATURE = 0.0001

`LLM_SYSTEM_MESSAGE`

type: str
required: False
default: You are a data extraction expert, your role is to extract data from the given text according to the provided schema. make sure your output is a valid JSON object.

the system message to use for the language model.

LLM_SYSTEM_MESSAGE = '...'

Under the hood

Under the hood, scrapy-llm utilizes two libraries to facilitate data extraction from web page text. The first library is Instructor, which uses pydantic to define a schema for the extracted data. This schema is then used to validate the extracted data and ensure that it conforms to the desired structure. By defining a schema for the extracted data, Instructor provides a clear and consistent way to organize and process the extracted information.

The second library is LiteLLM, which enables seamless integration between instructor and any API compatible with the OpenAI API specification. LiteLLM allows using any language model as long as it is deployed on an API compatible with the OpenAI API specification. This flexibility makes it easy to switch between different language models and experiment with different configurations to find the best model for a given task.

By combining the functionalities of Instructor and LiteLLM, scrapy-llm becomes a robust tool for extracting data from web page text. Whether it's scraping a single page or crawling an entire website, scrapy-llm offers a reliable and adaptable solution for all data extraction needs.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.20

Feb 10, 2025

0.1.19

Oct 2, 2024

0.1.18

Aug 1, 2024

0.1.17

Jul 30, 2024

0.1.16

Jul 30, 2024

0.1.13

Jul 22, 2024

0.1.12

Jul 21, 2024

0.1.11

Jul 21, 2024

This version

0.1.10

Jul 21, 2024

0.1.9

Jul 21, 2024

0.1.8

Jul 21, 2024

0.1.7

Jul 21, 2024

0.1.6

Jul 21, 2024

0.1.5

Jul 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_llm-0.1.10.tar.gz (11.5 kB view details)

Uploaded Jul 21, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_llm-0.1.10-py3-none-any.whl (7.7 kB view details)

Uploaded Jul 21, 2024 Python 3

File details

Details for the file scrapy_llm-0.1.10.tar.gz.

File metadata

Download URL: scrapy_llm-0.1.10.tar.gz
Upload date: Jul 21, 2024
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`4de17d735a523b97d06800406fa9303161635e97997f4fd600672c415286fd4d`
MD5	`9d98f71e0175f4a247c88670287fae81`
BLAKE2b-256	`e2c5cb9d880363a8f4b97e8450f7c4c990e26cd12b20e00ca0d75e093e04574b`

See more details on using hashes here.

File details

Details for the file scrapy_llm-0.1.10-py3-none-any.whl.

File metadata

Download URL: scrapy_llm-0.1.10-py3-none-any.whl
Upload date: Jul 21, 2024
Size: 7.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for scrapy_llm-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84af63acaf8f5f45e2e18a70d056ecd432df80756203c847e8162d4911494f3d`
MD5	`2f942bdd1f7bfb9c9442e09ccf5ffa77`
BLAKE2b-256	`650b6fcdbbebdc17b5e3815bc3878359e5c76ad58c6b2867512cb2e40c29e45a`

See more details on using hashes here.

scrapy-llm 0.1.10

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Scrapy-LLM

Features

Installation

Usage

Examples

Configuration

`LLM_RESPONSE_MODEL`

`LLM_UNWRAP_NESTED`

`LLM_API_BASE`

`LLM_MODEL`

`LLM_MODEL_TEMPERATURE`

`LLM_SYSTEM_MESSAGE`

Under the hood

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes