A Python library to extract structured data from web pages using LLMs.
Project description
LLMs Web Scraper
LLMs Web Scraper is an innovative Python library designed to simplify the process of extracting structured data from web pages using a generative AI model. Traditional web scraping methods often rely on fixed selectors, which can become outdated or broken due to frequent updates on websites. This can lead to significant maintenance challenges for developers and data analysts.
To address this issue, LLMs Web Scraper leverages advanced AI capabilities to intelligently identify and extract relevant data, adapting to changes in web page structures without the need for constant manual adjustments. This dynamic approach not only saves time and effort but also enhances the reliability of data extraction processes. With LLMs Web Scraper, users can efficiently gather and utilize web data, ensuring that their projects remain up-to-date and functional in the face of evolving web content.
Key Features
1. Multi-Model Support
- Use various language models for structured data extraction:
- Gemini (Google Generative AI): Powerful for extracting and analyzing large-scale content.
- OpenAI (ChatGPT): Supports models like GPT-4 and GPT-3.5 for natural language understanding.
- Groq: Integrates Groq models for specialized data extraction tasks.
- Ollama: Runs locally, suitable for privacy-focused use cases without relying on external APIs.
2. Structured Data Extraction
- Uses advanced language models to extract specific data from web pages based on user instructions.
- Customizable Instructions: Users can provide detailed prompts like:
-
Example 01:
Extract the relevant data from the following HTML and format it as a JSON object. The data to extract includes the title (h1), the paragraphs (p), and the link (a) with its URL and text. Example: { "title": "title", "paragraphs": [ "This paragraph contains...", "second paragraph...", ], "link": { "url": "https://example.com/", "text": "More information..." } } -
Example 02:
Extract the following information: 1. Titles of all blog posts on the page. 2. Author names for each blog post. 3. Publication dates of each blog post. Please provide the extracted information in a structured JSON format. Expecting property name enclosed in double quotes and values in string format. Example: { "blog_posts": [ { "title": "Blog Post 1", "author": "Author 1", "publication_date": "2022-01-01" }, { "title": "Blog Post 2", "author": "Author 2", "publication_date": "2022-01-02" } ] }
-
3. Retry Logic
- Built-in retry mechanism ensures resilience:
- Retries model invocations up to 3 times in case of failures.
- Uses exponential backoff to avoid overloading servers.
4. JSON Parsing
- Automatically extracts and parses structured JSON data from AI model outputs.
- Ensures users get clean, machine-readable results.
data = scraper.toJSON(url, instructions)
5. Save Data to File
- Provides functionality to save extracted data to JSON files:
data = scraper.toJSON(url, instructions) scraper.toFile(data, "output/extracted_data.json")
- Useful for saving results for future use, analysis, or sharing.
6. Flexible Model Configuration
- Model Name Selection: Choose specific models like
"gpt-4o-mini"for OpenAI or"gemini-2.0-flash-exp"for Gemini. - Custom API Keys: Use different API keys for multiple platforms.
- Temperature Control: Adjust model randomness for predictable or creative outputs.
Examples:
scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY")) # scraper = LLMsWebScraper(model_type="groq", model_name="llama-3.3-70b-versatile", api_key=os.getenv("Groq_API_KEY")) # scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY")) # scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="") ## If you want to add a API that is compatible with the OpenAI API, then: # scraper = LLMsWebScraper(model_type="other", model_name="your-model", base_url="if-have-url", api_key="your-key") # Value of 'model_type' should be 'other'. Don't change that.
7. Local Model Support (Ollama)
- Works with local language models via Ollama, which eliminates dependency on external APIs.
- Perfect for secure and private data extraction.
8. Advanced Logging
- Detailed logging for every step:
- Successful webpage fetching.
- Errors during HTML processing or model invocation.
- JSON parsing errors.
- Useful for debugging and monitoring.
Example Use Cases
-
Content Scraping:
- Extract the main content of an article, blog, or news page.
- Identify and collect headings, subheadings, and text.
-
Data Extraction for Research:
- Extract tables, product descriptions, or customer reviews from e-commerce websites.
- Collect structured data for analysis or training machine learning models.
-
Knowledge Graphs:
- Scrape and structure data from various sources to build knowledge graphs.
-
Privacy-Friendly Data Processing:
- Use Ollama or Groq for private, local processing without sending data to the cloud.
How to Use
-
Install the pip Library: Use the
pip installcommand.pip install LLMsWebScraper
-
Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.
Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.
For example:
from LLMsWebScraper import LLMsWebScraper import os from dotenv import load_dotenv import logging # Load environment variables load_dotenv() # Initialize the scraper scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY")) # scraper = LLMsWebScraper(model_type="groq", model_name="llama-3.3-70b-versatile", api_key=os.getenv("Groq_API_KEY")) # scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY")) # scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="") # Define instructions instructions = """ Extract the following information: 1. Titles of all blog posts on the page. 2. Author names for each blog post. 3. Publication dates of each blog post. Please provide the extracted information in a structured JSON format. Expecting property name enclosed in double quotes and values in string format. Example: { "blog_posts": [ { "title": "Blog Post 1", "author": "Author 1", "publication_date": "2022-01-01" }, { "title": "Blog Post 2", "author": "Author 2", "publication_date": "2022-01-02" } ] } """ # URL of the webpage to scrape url = "https://chirpy.cotes.page/" # Extract data blog_data = scraper.toJSON(url, instructions) # Print the data print(blog_data) # If need to save like as json file if blog_data: scraper.toFile(blog_data, "output/data.json") else: logging.warning("No blog data to save.")
Consider:
Supported Models when use Groq API
The following models are available for use with the Groq API key. Please note the intended usage and stability of each model:
| Model Type | Model Name | Notes |
|---|---|---|
| Production | llama-3.3-70b-versatile |
Stable for production use |
| Preview | llama-3.3-70b-specdec |
Intended for evaluation, may be discontinued |
| Preview | llama-3.2-1b-preview |
Intended for evaluation, may be discontinued |
| Preview | llama-3.2-3b-preview |
Intended for evaluation, may be discontinued |
| Preview | llama-3.2-11b-vision-preview |
Intended for evaluation, may be discontinued, includes vision capabilities |
| Preview | llama-3.2-90b-vision-preview |
Intended for evaluation, may be discontinued, includes vision capabilities |
Usage Guidelines
- Production Model: Use
llama-3.3-70b-versatilefor stable and reliable performance in production environments. - Preview Models: The preview models are primarily for evaluation purposes. They may be subject to discontinuation, so use them with caution in critical applications.
Make sure to select the appropriate model based on your project requirements and stability needs.
License
This pip library is available under the GPLv3 License.
Contact
- Author: KSDeshappriya
- Email: ksdeshappriya.official@gmail.com
Contribution
If you find any bugs or want to suggest improvements, feel free to open an issue or pull request on the GitHub repository.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmswebscraper-1.0.3.tar.gz.
File metadata
- Download URL: llmswebscraper-1.0.3.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e31e774db2e220185c9127904a96c280bcc710dd04ca7f4038c386a0145b99c
|
|
| MD5 |
9f51f74ec241f8cfce45d93c890f91b4
|
|
| BLAKE2b-256 |
9931604778006c0f6b258d60805c6d6904daa76aeebb605ad72e1d6da5a4a8bc
|
Provenance
The following attestation bundles were made for llmswebscraper-1.0.3.tar.gz:
Publisher:
python-publish.yml on KSDeshappriya/LLMsWebScraper-pip
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmswebscraper-1.0.3.tar.gz -
Subject digest:
6e31e774db2e220185c9127904a96c280bcc710dd04ca7f4038c386a0145b99c - Sigstore transparency entry: 158296146
- Sigstore integration time:
-
Permalink:
KSDeshappriya/LLMsWebScraper-pip@4e616044ff6e102e5ca5ebbfee3e3839793a329d -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/KSDeshappriya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4e616044ff6e102e5ca5ebbfee3e3839793a329d -
Trigger Event:
release
-
Statement type:
File details
Details for the file LLMsWebScraper-1.0.3-py3-none-any.whl.
File metadata
- Download URL: LLMsWebScraper-1.0.3-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9850b72ac2bbb6829b65be8995597d14d5df8f0b5560f54d37e666c58a1ae94
|
|
| MD5 |
54e00fa77dea06b0340d1c59f83a59d2
|
|
| BLAKE2b-256 |
4cb60775ecba6d4e248b01d99b53318540fcb0755fd7316f077ce2bd88062aa4
|
Provenance
The following attestation bundles were made for LLMsWebScraper-1.0.3-py3-none-any.whl:
Publisher:
python-publish.yml on KSDeshappriya/LLMsWebScraper-pip
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmswebscraper-1.0.3-py3-none-any.whl -
Subject digest:
e9850b72ac2bbb6829b65be8995597d14d5df8f0b5560f54d37e666c58a1ae94 - Sigstore transparency entry: 158296150
- Sigstore integration time:
-
Permalink:
KSDeshappriya/LLMsWebScraper-pip@4e616044ff6e102e5ca5ebbfee3e3839793a329d -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/KSDeshappriya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@4e616044ff6e102e5ca5ebbfee3e3839793a329d -
Trigger Event:
release
-
Statement type: