Skip to main content

A Python library to extract structured data from web pages using LLMs.

Project description

LLMs Web Scraper

LLM Web Scraper is a Python library that uses a generative AI model to extract structured data from web pages.

Features

  • Fetch HTML content from web pages.
  • Extract structured data using instructions and an LLM.
  • Save extracted data to a JSON file.

How to Use

  1. Install the pip Library: Use the pip install command.

    pip install LLMsWebScraper
    
  2. Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.

    Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.

    For example:

    from LLMsWebScraper import LLMsWebScraper  
    import os
    from dotenv import load_dotenv
    import logging
    
    # Load environment variables
    load_dotenv()
    
    # Initialize the scraper
    scraper = LLMsWebScraper(api_key=os.getenv("Gemini_API_KEY"))
    
    # Define instructions
    instructions = """
    Extract the following information:
    1. Titles of all blog posts on the page.
    2. Author names for each blog post.
    3. Publication dates of each blog post.
    
    Please provide the extracted information in a structured JSON format.
    Expecting property name enclosed in double quotes and values in string format.
    Example:
    {
    "blog_posts": [
            {
                "title": "Blog Post 1",
                "author": "Author 1",
                "publication_date": "2022-01-01"
            },
            {
                "title": "Blog Post 2",
                "author": "Author 2",
                "publication_date": "2022-01-02"
            }
        ]
    }
    """
    
    # URL of the webpage to scrape
    url = "https://chirpy.cotes.page/"
    
    # Extract data
    blog_data = scraper.toJSON(url, instructions)
    
    # Print the data
    print(blog_data)
    
    # If need to save like as json file
    if blog_data:
        scraper.toFile(blog_data, "output/data.json")
    else:
        logging.warning("No blog data to save.")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmswebscraper-1.0.1.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LLMsWebScraper-1.0.1-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file llmswebscraper-1.0.1.tar.gz.

File metadata

  • Download URL: llmswebscraper-1.0.1.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llmswebscraper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 066d412b3d74cf508175aefd610a314b6f6d4eb1e3ce8d92531612b4b280556d
MD5 d3ee5f738b3e718cc9244ed41dfe7d8c
BLAKE2b-256 554cbd74d5d6dc474e87f834d320cdbb496de805c1cb4b2f520dec4d3140af25

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmswebscraper-1.0.1.tar.gz:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file LLMsWebScraper-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: LLMsWebScraper-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for LLMsWebScraper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7a10f2db172c98bb51a257d3950220ba9c84711bb5660e2b987c6d8b34bef3fb
MD5 c54c9d14132865dec18563a1bd911623
BLAKE2b-256 94655c98bb9755fccfc39c7c77c12661acf5dcb0c84422edb5bad7592de82193

See more details on using hashes here.

Provenance

The following attestation bundles were made for LLMsWebScraper-1.0.1-py3-none-any.whl:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page