Experimental library for leveraging GPT for web scraping.

These details have not been verified by PyPI

Project description

scrapeghost

An experiment in using GPT-4 to scrape websites.

Caution: Use at your own risk, a single call can cost somewhere around $0.36 on larger pages at current rates.

See the examples directory for current usage.

License

Currently licensed under Hippocratic License 3.0, see LICENSE.md for details.

Usage

You will need an OpenAI API key with access to the GPT-4 API. Configure those as you otherwise would via the openai library.

import openai
openai.organization = os.getenv("OPENAI_API_ORG")
openai.api_key = os.getenv("OPENAI_API_KEY")

Basics

The SchemaScraper class is the main interface for building automatic scrapers.

To build a scraper, you provide a schema that describes the data you want to collect.

>>> from scrapeghost import SchemaScraper
>>> scrape_legislators = SchemaScraper(
    schema={
        "name": "string",
        "url": "url",
        "district": "string",
        "party": "string",
        "photo_url": "url",
        "offices": [{"name": "string", "address": "string", "phone": "string"}],
    }
)

There's no pre-defined format for the schema, the GPT models do a good job of figuring out what you want and you can use whatever values you want to provide hints.

You can then call the scraper with a URL to scrape:

>>> scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
{'name': 'Emanuel "Chris" Welch',
 'url': 'https://www.ilga.gov/house/Rep.asp?MemberID=3071',
 'district': '7th', 'party': 'D', 
 'photo_url': 'https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg',
  'offices': [
    {'name': 'Springfield Office', 'address': '300 Capitol Building, Springfield, IL 62706', 'phone': '(217) 782-5350'},
    {'name': 'District Office', 'address': '10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154', 'phone': '(708) 450-1000'}
   ]}

That's it.

Command Line Usage

If you've installed the package (e.g. with pipx), you can use the scrapeghost command line tool to experiment.

scrapeghost https://www.ncleg.gov/Members/Biography/S/436  \
  --schema "{'first_name': 'str', 'last_name': 'str',
             'photo_url': 'url', 'offices': [] }"  \
  --gpt4

{'first_name': 'Gale',
 'last_name': 'Adcock',
 'photo_url': 'https://www.ncleg.gov/Members/MemberImage/S/436/Low',
 'offices': [
    {'address': '16 West Jones Street, Rm. 1104',
     'city': 'Raleigh', 'state': 'NC', 'zip': '27601',
     'phone': '(919) 715-3036',
     'email': 'Gale.Adcock@ncleg.gov',
     'legislative_assistant': 'Elizabeth Sharpe',
     'legislative_assistant_email': 'Elizabeth.Sharpe@ncleg.gov'
    }
  ]
}

 Usage: scrapeghost [OPTIONS] URL                                                                                               
                                                                                                                                
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────╮
│ *    url      TEXT  [default: None] [required]                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────╮
│ --xpath                         TEXT     XPath selector to narrow the scrape [default: None]      │
│ --css                           TEXT     CSS selector to narrow the scrape [default: None]        │
│ --schema                        TEXT     Schema to use for scraping [default: None]               │
│ --schema-file                   PATH     Path to schema.json file [default: None]                 │
│ --gpt4             --no-gpt4             Use GPT-4 instead of GPT-3.5-turbo [default: no-gpt4]    │
│ --verbose      -v               INTEGER  Verbosity level 0-2 [default: 0]                         │
│ --help                                   Show this message and exit.                              │
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

Features

Selectors

The main limitation you'll run into is the token limit. Depending on the model you're using you're limited to 4096 or 8192 tokens per call. Billing is also based on tokens sent and received.

One strategy to deal with this is to provide a CSS or XPath selector to the scraper. This will pre-filter the HTML that is sent to the server, keeping you under the limit and saving you money.

Pass the css or xpath arguments to the scraper to use a selector:

>>> scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071", xpath="//table[1]")

SchemaScraper Options

model - The GPT model to use, defaults to gpt-4, can also be gpt-3.5-turbo.
list_mode - If True the scraper will return a list of objects instead of a single object. (Alters the prompts and some behavior.)
split_length - If set, the scraper will split the page into multiple calls, each of this length. (Only works with list_mode, requires passing a css or xpath selector when scraping.)
model_params - A dictionary of parameters to pass to the underlying GPT model.
extra_instructions - Additional instructions to pass to the GPT model.

Auto-splitting

It's worth mentioning how split_length works because it allows for some interesting possibilities but can also become quite expensive.

If you pass split_length to the scraper, it assumes the page is made of multiple similar sections and will try to split the page into multiple calls.

When you call the scrape function of an auto-splitting enabled scraper, you are required to pass a css or xpath selector to the function. The resulting nodes will be combined into chunks no bigger than split_length tokens, sent to the API, and then stitched back together.

This seems to work well for long lists of similar items, though whether it is worth the many calls is questionable.

Look at examples/cbb.py for an example of a 800+ item page that is split into many calls.

Changelog

0.2.0 - 2021-03-18

Add list mode, auto-splitting, and pagination support.
Improve xpath and css handling.
Improve prompt for GPT 3.5.
Make it possible to alter parameters when calling scrape.
Logging & error handling.
Command line interface.
See blog post for details: https://jamesturk.net/posts/scraping-with-gpt-4-part-2/

0.1.0 - 2021-03-17

Initial experiment, see blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

Nov 25, 2023

0.5.1

Jun 13, 2023

0.5.0

Jun 6, 2023

0.4.4

Mar 31, 2023

0.4.3

Mar 31, 2023

0.4.2

Mar 26, 2023

0.4.1

Mar 25, 2023

0.4.0

Mar 25, 2023

0.3.0

Mar 20, 2023

This version

0.2.0

Mar 19, 2023

0.1.0

Mar 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeghost-0.2.0.tar.gz (12.7 kB view details)

Uploaded Mar 19, 2023 Source

Built Distribution

scrapeghost-0.2.0-py3-none-any.whl (14.3 kB view details)

Uploaded Mar 19, 2023 Python 3

File details

Details for the file scrapeghost-0.2.0.tar.gz.

File metadata

Download URL: scrapeghost-0.2.0.tar.gz
Upload date: Mar 19, 2023
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.9 Darwin/21.6.0

File hashes

Hashes for scrapeghost-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e475fa5282bf8e1a57cfcf0bee564f7e2d9712ce763d7aba6c278e1977c50399`
MD5	`ec00d6083a85c4cc9d75860540e6172d`
BLAKE2b-256	`4ab5363f9210c8aaaa5c79498566c4ad4b0feaade9131a075097420625b3f021`

See more details on using hashes here.

File details

Details for the file scrapeghost-0.2.0-py3-none-any.whl.

File metadata

Download URL: scrapeghost-0.2.0-py3-none-any.whl
Upload date: Mar 19, 2023
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.0 CPython/3.10.9 Darwin/21.6.0

File hashes

Hashes for scrapeghost-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1d4a05650711c7ba01a5b8fe4ba516d33f3b1058db87b47c3a91ed5abd6b04b`
MD5	`d139aa61920ff4e18d1088c7ff6a2ae5`
BLAKE2b-256	`318ccee414b607ec1dab0d26775d5b4e8d03a577d324c9360f305efa98718db6`