A Python library to extract structured data from web pages using LLMs.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KSDeshappriya

These details have not been verified by PyPI

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LLMs Web Scraper

LLM Web Scraper is a Python library that uses a generative AI model to extract structured data from web pages.

Features

Fetch HTML content from web pages.
Extract structured data using instructions and an LLM.
Save extracted data to a JSON file.

How to Use

Install the pip Library: Use the pip install command.
```
pip install LLMsWebScraper
```

Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.

Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.

For example:

from LLMsWebScraper import LLMsWebScraper  
import os
from dotenv import load_dotenv
import logging

# Load environment variables
load_dotenv()

# Initialize the scraper
scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY"))
# scraper = LLMsWebScraper(model_type="groq", model_name="llama3-8b-8192", api_key=os.getenv("Groq_API_KEY"))
# scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY"))
# scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="")


# Define instructions
instructions = """
Extract the following information:
1. Titles of all blog posts on the page.
2. Author names for each blog post.
3. Publication dates of each blog post.

Please provide the extracted information in a structured JSON format.
Expecting property name enclosed in double quotes and values in string format.
Example:
{
"blog_posts": [
        {
            "title": "Blog Post 1",
            "author": "Author 1",
            "publication_date": "2022-01-01"
        },
        {
            "title": "Blog Post 2",
            "author": "Author 2",
            "publication_date": "2022-01-02"
        }
    ]
}
"""

# URL of the webpage to scrape
url = "https://chirpy.cotes.page/"

# Extract data
blog_data = scraper.toJSON(url, instructions)

# Print the data
print(blog_data)

# If need to save like as json file
if blog_data:
    scraper.toFile(blog_data, "output/data.json")
else:
    logging.warning("No blog data to save.")

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

KSDeshappriya

These details have not been verified by PyPI

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.0.3

Dec 30, 2024

This version

1.0.2

Dec 26, 2024

1.0.1

Dec 26, 2024

1.0.0

Dec 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmswebscraper-1.0.2.tar.gz (8.0 kB view details)

Uploaded Dec 26, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

LLMsWebScraper-1.0.2-py3-none-any.whl (8.9 kB view details)

Uploaded Dec 26, 2024 Python 3

File details

Details for the file llmswebscraper-1.0.2.tar.gz.

File metadata

Download URL: llmswebscraper-1.0.2.tar.gz
Upload date: Dec 26, 2024
Size: 8.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llmswebscraper-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`2f396ddaab50484b05708abb87b48700cf48084e54e661ca09d0a69a911af480`
MD5	`86efdc78db29167f5ee3c9b94d2aa0d6`
BLAKE2b-256	`12eb153b5f992b51da828114605f9caa391766f9c8b99e188175c559a1666960`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmswebscraper-1.0.2.tar.gz:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmswebscraper-1.0.2.tar.gz
- Subject digest: 2f396ddaab50484b05708abb87b48700cf48084e54e661ca09d0a69a911af480
- Sigstore transparency entry: 157727481
- Sigstore integration time: Dec 26, 2024
Source repository:
- Permalink: KSDeshappriya/LLMsWebScraper-pip@76cbcb9f25b2ba002e29eef7c87c7bee41dfe7c2
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/KSDeshappriya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@76cbcb9f25b2ba002e29eef7c87c7bee41dfe7c2
- Trigger Event: release

File details

Details for the file LLMsWebScraper-1.0.2-py3-none-any.whl.

File metadata

Download URL: LLMsWebScraper-1.0.2-py3-none-any.whl
Upload date: Dec 26, 2024
Size: 8.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for LLMsWebScraper-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`96ee83c98c8511dd8d458fe95d367971d415ed823dc7ac55f6a70fec77e6c4dd`
MD5	`a399690590a49c237428b0b7e7942781`
BLAKE2b-256	`57da6d6d337ddefcf8852d8421a2c5bfedb3740f090925e9f2eb784cc6ea542d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for LLMsWebScraper-1.0.2-py3-none-any.whl:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmswebscraper-1.0.2-py3-none-any.whl
- Subject digest: 96ee83c98c8511dd8d458fe95d367971d415ed823dc7ac55f6a70fec77e6c4dd
- Sigstore transparency entry: 157727482
- Sigstore integration time: Dec 26, 2024
Source repository:
- Permalink: KSDeshappriya/LLMsWebScraper-pip@76cbcb9f25b2ba002e29eef7c87c7bee41dfe7c2
- Branch / Tag: refs/tags/v1.0.2
- Owner: https://github.com/KSDeshappriya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@76cbcb9f25b2ba002e29eef7c87c7bee41dfe7c2
- Trigger Event: release

LLMsWebScraper 1.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LLMs Web Scraper

Features

How to Use

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance