Skip to main content

A Python library to extract structured data from web pages using LLMs.

Project description

LLMs Web Scraper

LLM Web Scraper is a Python library that uses a generative AI model to extract structured data from web pages.

Features

  • Fetch HTML content from web pages.
  • Extract structured data using instructions and an LLM.
  • Save extracted data to a JSON file.

How to Use

  1. Install the pip Library: Use the pip install command.

    pip install LLMsWebScraper
    
  2. Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.

    Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.

    For example:

    from LLMsWebScraper import LLMsWebScraper  
    import os
    from dotenv import load_dotenv
    
    # Load environment variables
    load_dotenv()
    
    # Initialize the scraper
    scraper = LLMsWebScraper(api_key=os.getenv("Gemini_API_KEY"))
    
    # Define instructions
    instructions = """
    Extract the following information:
    1. Titles of all blog posts on the page.
    2. Author names for each blog post.
    3. Publication dates of each blog post.
    
    Please provide the extracted information in a structured JSON format.
    Expecting property name enclosed in double quotes and values in string format.
    Example:
    {
    "blog_posts": [
            {
                "title": "Blog Post 1",
                "author": "Author 1",
                "publication_date": "2022-01-01"
            },
            {
                "title": "Blog Post 2",
                "author": "Author 2",
                "publication_date": "2022-01-02"
            }
        ]
    }
    """
    
    # URL of the webpage to scrape
    url = "https://chirpy.cotes.page/"
    
    # Extract data
    blog_data = scraper.toJSON(url, instructions)
    
    # Print the data
    print(blog_data)
    
    # If need to save like as json file
    if blog_data:
        scraper.toFile(blog_data, "output/data.json")
    else:
        logging.warning("No blog data to save.")
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmswebscraper-1.0.0.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

LLMsWebScraper-1.0.0-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file llmswebscraper-1.0.0.tar.gz.

File metadata

  • Download URL: llmswebscraper-1.0.0.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for llmswebscraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a62abe7328f16075d8333b1e9b3625ba831ba1f9c8eab9dae742eb6345ef0f09
MD5 8e4789ac0d30ea9ed3f05b510cd45a2e
BLAKE2b-256 e538539853aab28c3bdd42280964e0ec60823955b1dd5134704be6651c25430e

See more details on using hashes here.

File details

Details for the file LLMsWebScraper-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: LLMsWebScraper-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for LLMsWebScraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 349e573837af07416430e965245cdc72578259e36e9cae5445ae51350d958cd4
MD5 7d82ac6ea27c20a2500fc16f1fb8587e
BLAKE2b-256 d32ea545446bec18867ceb568d7aea56cab4b88736c7b8945b1cd4bcbb6cf668

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page