A Python library to extract structured data from web pages using LLMs.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KSDeshappriya

These details have not been verified by PyPI

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LLMs Web Scraper

LLMs Web Scraper is an innovative Python library designed to simplify the process of extracting structured data from web pages using a generative AI model. Traditional web scraping methods often rely on fixed selectors, which can become outdated or broken due to frequent updates on websites. This can lead to significant maintenance challenges for developers and data analysts.

To address this issue, LLMs Web Scraper leverages advanced AI capabilities to intelligently identify and extract relevant data, adapting to changes in web page structures without the need for constant manual adjustments. This dynamic approach not only saves time and effort but also enhances the reliability of data extraction processes. With LLMs Web Scraper, users can efficiently gather and utilize web data, ensuring that their projects remain up-to-date and functional in the face of evolving web content.

Key Features

1. Multi-Model Support

Use various language models for structured data extraction:
- Gemini (Google Generative AI): Powerful for extracting and analyzing large-scale content.
- OpenAI (ChatGPT): Supports models like GPT-4 and GPT-3.5 for natural language understanding.
- Groq: Integrates Groq models for specialized data extraction tasks.
- Ollama: Runs locally, suitable for privacy-focused use cases without relying on external APIs.

2. Structured Data Extraction

Uses advanced language models to extract specific data from web pages based on user instructions.

Customizable Instructions: Users can provide detailed prompts like:

Example 01:

    Extract the relevant data from the following HTML and format it as a JSON object. 
    The data to extract includes the title (h1), the paragraphs (p), and the link (a) 
    with its URL and text.

    Example:
    {
        "title": "title",
        "paragraphs": [
            "This paragraph contains...",
            "second paragraph...",
        ],
        "link": {
            "url": "https://example.com/",
            "text": "More information..."
        }
    }

Example 02:

    Extract the following information:
    1. Titles of all blog posts on the page.
    2. Author names for each blog post.
    3. Publication dates of each blog post.

    Please provide the extracted information in a structured JSON format.
    Expecting property name enclosed in double quotes and values in string format.
    Example:
    {
    "blog_posts": [
            {
                "title": "Blog Post 1",
                "author": "Author 1",
                "publication_date": "2022-01-01"
            },
            {
                "title": "Blog Post 2",
                "author": "Author 2",
                "publication_date": "2022-01-02"
            }
        ]
    }

3. Retry Logic

Built-in retry mechanism ensures resilience:
- Retries model invocations up to 3 times in case of failures.
- Uses exponential backoff to avoid overloading servers.

4. JSON Parsing

Automatically extracts and parses structured JSON data from AI model outputs.
Ensures users get clean, machine-readable results.
```
data = scraper.toJSON(url, instructions)
```

5. Save Data to File

Provides functionality to save extracted data to JSON files:

data = scraper.toJSON(url, instructions)
scraper.toFile(data, "output/extracted_data.json")

Useful for saving results for future use, analysis, or sharing.

6. Flexible Model Configuration

Model Name Selection: Choose specific models like "gpt-4o-mini" for OpenAI or "gemini-2.0-flash-exp" for Gemini.
Custom API Keys: Use different API keys for multiple platforms.

Temperature Control: Adjust model randomness for predictable or creative outputs. Examples:

scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY"))
# scraper = LLMsWebScraper(model_type="groq", model_name="llama-3.3-70b-versatile", api_key=os.getenv("Groq_API_KEY"))
# scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY"))
# scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="")

## If you want to add a API that is compatible with the OpenAI API, then:
# scraper = LLMsWebScraper(model_type="other", model_name="your-model", base_url="if-have-url", api_key="your-key") # Value of 'model_type' should be 'other'. Don't change that.

7. Local Model Support (Ollama)

Works with local language models via Ollama, which eliminates dependency on external APIs.
Perfect for secure and private data extraction.

8. Advanced Logging

Detailed logging for every step:
- Successful webpage fetching.
- Errors during HTML processing or model invocation.
- JSON parsing errors.
Useful for debugging and monitoring.

Example Use Cases

Content Scraping:
- Extract the main content of an article, blog, or news page.
- Identify and collect headings, subheadings, and text.
Data Extraction for Research:
- Extract tables, product descriptions, or customer reviews from e-commerce websites.
- Collect structured data for analysis or training machine learning models.
Knowledge Graphs:
- Scrape and structure data from various sources to build knowledge graphs.
Privacy-Friendly Data Processing:
- Use Ollama or Groq for private, local processing without sending data to the cloud.

How to Use

Install the pip Library: Use the pip install command.
```
pip install LLMsWebScraper
```

Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.

Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.

For example:

from LLMsWebScraper import LLMsWebScraper  
import os
from dotenv import load_dotenv
import logging

# Load environment variables
load_dotenv()

# Initialize the scraper
scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY"))
# scraper = LLMsWebScraper(model_type="groq", model_name="llama-3.3-70b-versatile", api_key=os.getenv("Groq_API_KEY"))
# scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY"))
# scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="")


# Define instructions
instructions = """
Extract the following information:
1. Titles of all blog posts on the page.
2. Author names for each blog post.
3. Publication dates of each blog post.

Please provide the extracted information in a structured JSON format.
Expecting property name enclosed in double quotes and values in string format.
Example:
{
"blog_posts": [
        {
            "title": "Blog Post 1",
            "author": "Author 1",
            "publication_date": "2022-01-01"
        },
        {
            "title": "Blog Post 2",
            "author": "Author 2",
            "publication_date": "2022-01-02"
        }
    ]
}
"""

# URL of the webpage to scrape
url = "https://chirpy.cotes.page/"

# Extract data
blog_data = scraper.toJSON(url, instructions)

# Print the data
print(blog_data)

# If need to save like as json file
if blog_data:
    scraper.toFile(blog_data, "output/data.json")
else:
    logging.warning("No blog data to save.")

Consider:

Supported Models when use Groq API

The following models are available for use with the Groq API key. Please note the intended usage and stability of each model:

Model Type	Model Name	Notes
Production	`llama-3.3-70b-versatile`	Stable for production use
Preview	`llama-3.3-70b-specdec`	Intended for evaluation, may be discontinued
Preview	`llama-3.2-1b-preview`	Intended for evaluation, may be discontinued
Preview	`llama-3.2-3b-preview`	Intended for evaluation, may be discontinued
Preview	`llama-3.2-11b-vision-preview`	Intended for evaluation, may be discontinued, includes vision capabilities
Preview	`llama-3.2-90b-vision-preview`	Intended for evaluation, may be discontinued, includes vision capabilities

Usage Guidelines

Production Model: Use llama-3.3-70b-versatile for stable and reliable performance in production environments.
Preview Models: The preview models are primarily for evaluation purposes. They may be subject to discontinuation, so use them with caution in critical applications.

Make sure to select the appropriate model based on your project requirements and stability needs.

License

This pip library is available under the GPLv3 License.

Contact

Author: KSDeshappriya
Email: ksdeshappriya.official@gmail.com

Contribution

If you find any bugs or want to suggest improvements, feel free to open an issue or pull request on the GitHub repository.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KSDeshappriya

These details have not been verified by PyPI

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.3

Dec 30, 2024

1.0.2

Dec 26, 2024

1.0.1

Dec 26, 2024

1.0.0

Dec 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmswebscraper-1.0.3.tar.gz (11.7 kB view details)

Uploaded Dec 30, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

LLMsWebScraper-1.0.3-py3-none-any.whl (12.4 kB view details)

Uploaded Dec 30, 2024 Python 3

File details

Details for the file llmswebscraper-1.0.3.tar.gz.

File metadata

Download URL: llmswebscraper-1.0.3.tar.gz
Upload date: Dec 30, 2024
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llmswebscraper-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`6e31e774db2e220185c9127904a96c280bcc710dd04ca7f4038c386a0145b99c`
MD5	`9f51f74ec241f8cfce45d93c890f91b4`
BLAKE2b-256	`9931604778006c0f6b258d60805c6d6904daa76aeebb605ad72e1d6da5a4a8bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmswebscraper-1.0.3.tar.gz:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmswebscraper-1.0.3.tar.gz
- Subject digest: 6e31e774db2e220185c9127904a96c280bcc710dd04ca7f4038c386a0145b99c
- Sigstore transparency entry: 158296146
- Sigstore integration time: Dec 30, 2024
Source repository:
- Permalink: KSDeshappriya/LLMsWebScraper-pip@4e616044ff6e102e5ca5ebbfee3e3839793a329d
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/KSDeshappriya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@4e616044ff6e102e5ca5ebbfee3e3839793a329d
- Trigger Event: release

File details

Details for the file LLMsWebScraper-1.0.3-py3-none-any.whl.

File metadata

Download URL: LLMsWebScraper-1.0.3-py3-none-any.whl
Upload date: Dec 30, 2024
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for LLMsWebScraper-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9850b72ac2bbb6829b65be8995597d14d5df8f0b5560f54d37e666c58a1ae94`
MD5	`54e00fa77dea06b0340d1c59f83a59d2`
BLAKE2b-256	`4cb60775ecba6d4e248b01d99b53318540fcb0755fd7316f077ce2bd88062aa4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for LLMsWebScraper-1.0.3-py3-none-any.whl:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmswebscraper-1.0.3-py3-none-any.whl
- Subject digest: e9850b72ac2bbb6829b65be8995597d14d5df8f0b5560f54d37e666c58a1ae94
- Sigstore transparency entry: 158296150
- Sigstore integration time: Dec 30, 2024
Source repository:
- Permalink: KSDeshappriya/LLMsWebScraper-pip@4e616044ff6e102e5ca5ebbfee3e3839793a329d
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/KSDeshappriya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@4e616044ff6e102e5ca5ebbfee3e3839793a329d
- Trigger Event: release

LLMsWebScraper 1.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LLMs Web Scraper

Key Features

1. Multi-Model Support

2. Structured Data Extraction

3. Retry Logic

4. JSON Parsing

5. Save Data to File

6. Flexible Model Configuration

7. Local Model Support (Ollama)

8. Advanced Logging

Example Use Cases

How to Use

Consider:

Supported Models when use Groq API

Usage Guidelines

License

Contact

Contribution

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance