Skip to main content

A Python library to extract structured data from web pages using LLMs.

Project description

LLMs Web Scraper

LLM Web Scraper is a Python library that uses a generative AI model to extract structured data from web pages.

Features

  • Fetch HTML content from web pages.
  • Extract structured data using instructions and an LLM.
  • Save extracted data to a JSON file.

How to Use

  1. Install the pip Library: Use the pip install command.

    pip install LLMsWebScraper
    
  2. Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.

    Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.

    For example:

    from LLMsWebScraper import LLMsWebScraper  
    import os
    from dotenv import load_dotenv
    import logging
    
    # Load environment variables
    load_dotenv()
    
    # Initialize the scraper
    scraper = LLMsWebScraper(model_type="gemini", model_name="gemini-2.0-flash-exp", api_key=os.getenv("Gemini_API_KEY"))
    # scraper = LLMsWebScraper(model_type="groq", model_name="llama3-8b-8192", api_key=os.getenv("Groq_API_KEY"))
    # scraper = LLMsWebScraper(model_type="openai", model_name="gpt-4o-mini", api_key=os.getenv("OpenAI_API_KEY"))
    # scraper = LLMsWebScraper(model_type="ollama", model_name="llama3.2", base_url="http://localhost:11434", api_key="")
    
    
    # Define instructions
    instructions = """
    Extract the following information:
    1. Titles of all blog posts on the page.
    2. Author names for each blog post.
    3. Publication dates of each blog post.
    
    Please provide the extracted information in a structured JSON format.
    Expecting property name enclosed in double quotes and values in string format.
    Example:
    {
    "blog_posts": [
            {
                "title": "Blog Post 1",
                "author": "Author 1",
                "publication_date": "2022-01-01"
            },
            {
                "title": "Blog Post 2",
                "author": "Author 2",
                "publication_date": "2022-01-02"
            }
        ]
    }
    """
    
    # URL of the webpage to scrape
    url = "https://chirpy.cotes.page/"
    
    # Extract data
    blog_data = scraper.toJSON(url, instructions)
    
    # Print the data
    print(blog_data)
    
    # If need to save like as json file
    if blog_data:
        scraper.toFile(blog_data, "output/data.json")
    else:
        logging.warning("No blog data to save.")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmswebscraper-1.0.2.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LLMsWebScraper-1.0.2-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file llmswebscraper-1.0.2.tar.gz.

File metadata

  • Download URL: llmswebscraper-1.0.2.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llmswebscraper-1.0.2.tar.gz
Algorithm Hash digest
SHA256 2f396ddaab50484b05708abb87b48700cf48084e54e661ca09d0a69a911af480
MD5 86efdc78db29167f5ee3c9b94d2aa0d6
BLAKE2b-256 12eb153b5f992b51da828114605f9caa391766f9c8b99e188175c559a1666960

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmswebscraper-1.0.2.tar.gz:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file LLMsWebScraper-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: LLMsWebScraper-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for LLMsWebScraper-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 96ee83c98c8511dd8d458fe95d367971d415ed823dc7ac55f6a70fec77e6c4dd
MD5 a399690590a49c237428b0b7e7942781
BLAKE2b-256 57da6d6d337ddefcf8852d8421a2c5bfedb3740f090925e9f2eb784cc6ea542d

See more details on using hashes here.

Provenance

The following attestation bundles were made for LLMsWebScraper-1.0.2-py3-none-any.whl:

Publisher: python-publish.yml on KSDeshappriya/LLMsWebScraper-pip

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page