A Python library to extract structured data from web pages using LLMs.
Project description
LLMs Web Scraper
LLM Web Scraper is a Python library that uses a generative AI model to extract structured data from web pages.
Features
- Fetch HTML content from web pages.
- Extract structured data using instructions and an LLM.
- Save extracted data to a JSON file.
How to Use
-
Install the pip Library: Use the
pip installcommand.pip install LLMsWebScraper
-
Test the Installed Library After the library is installed, you can import and use it in your Python projects just like any other library.
Create a Python File to Test It: Create a new Python file or open a Python REPL to use your library.
For example:
from LLMsWebScraper import LLMsWebScraper import os from dotenv import load_dotenv # Load environment variables load_dotenv() # Initialize the scraper scraper = LLMsWebScraper(api_key=os.getenv("Gemini_API_KEY")) # Define instructions instructions = """ Extract the following information: 1. Titles of all blog posts on the page. 2. Author names for each blog post. 3. Publication dates of each blog post. Please provide the extracted information in a structured JSON format. Expecting property name enclosed in double quotes and values in string format. Example: { "blog_posts": [ { "title": "Blog Post 1", "author": "Author 1", "publication_date": "2022-01-01" }, { "title": "Blog Post 2", "author": "Author 2", "publication_date": "2022-01-02" } ] } """ # URL of the webpage to scrape url = "https://chirpy.cotes.page/" # Extract data blog_data = scraper.toJSON(url, instructions) # Print the data print(blog_data) # If need to save like as json file if blog_data: scraper.toFile(blog_data, "output/data.json") else: logging.warning("No blog data to save.")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmswebscraper-1.0.0.tar.gz.
File metadata
- Download URL: llmswebscraper-1.0.0.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a62abe7328f16075d8333b1e9b3625ba831ba1f9c8eab9dae742eb6345ef0f09
|
|
| MD5 |
8e4789ac0d30ea9ed3f05b510cd45a2e
|
|
| BLAKE2b-256 |
e538539853aab28c3bdd42280964e0ec60823955b1dd5134704be6651c25430e
|
File details
Details for the file LLMsWebScraper-1.0.0-py3-none-any.whl.
File metadata
- Download URL: LLMsWebScraper-1.0.0-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
349e573837af07416430e965245cdc72578259e36e9cae5445ae51350d958cd4
|
|
| MD5 |
7d82ac6ea27c20a2500fc16f1fb8587e
|
|
| BLAKE2b-256 |
d32ea545446bec18867ceb568d7aea56cab4b88736c7b8945b1cd4bcbb6cf668
|