AI HTML Parser
Project description
AI-Powered HTML Parser
Installation
Prerequisites
- Python 3.8 or higher
- Required Libraries:
requestsbs4(BeautifulSoup)
Steps
- Clone the repository:
git clone https://github.com/pythonshik/ai-html-parser.git cd ai-html-parser
- Install dependencies:
pip install -r requirements.txt
- Set up your API key for Google Gemini:
- Create a folder named
AIin the root directory. - Add your API key to a file named
gemini_api_keyinside theAIfolder.
- Create a folder named
Usage
Example
- Import the
AIparserclass:from AIparse import AIparser
- Initialize the parser with a URL:
element = AIparser("https://www.youtube.com/@PythonShik")
- Parse specific elements:
for i in ["number of videos", "number of subscribers"]: parsed_data = element.parse(i) print(f"{parsed_data['explain']}: {parsed_data['value']}")
- Output example:
{ "value": "96", "explain": "Number of subscribers", "result": "96 subscribers" }
Overview
This project is an AI-powered HTML parser designed to extract specific data from web pages using Google Gemini's text generation API. The parser processes the HTML source code of a webpage, identifies specific elements, and returns the desired information in a structured JSON format.
Key Features
- AI Integration: Utilizes Google Gemini for intelligent text analysis.
- HTML Parsing: Extracts and processes HTML elements using BeautifulSoup.
- Customizable Instructions: Supports user-defined parsing instructions.
- JSON Output: Provides clear and structured results in JSON format.
How It Works
- User Input: Provide a URL and the target element to parse.
- HTML Fetching: The tool fetches the HTML source code of the webpage.
- AI Analysis: The HTML source and target element are sent to the AI for processing.
- JSON Output: The AI generates a structured response containing the extracted information.
File Descriptions
1. BASE.py
The core class for interacting with Google Gemini's text generation API.
- Features:
- API key management.
- Methods for adding and managing conversation history.
- Text generation using the
generate()method.
- Key Methods:
history_add(role, content): Adds messages to the conversation history.generate(): Sends data to gemini API and retrieves the generated text.export_history(filename): Saves conversation history to a file.import_history(filename): Loads conversation history from a file.clear_history(filename): Clears the conversation history.
2. prompts.py
Defines the instruction format for AI tasks.
- Key Class:
Instructionsfirst_instruction: Provides a detailed guide for parsing HTML elements and formatting the response.
3. main.py
The main entry point for the application.
- Features:
- Manages the parsing process using
AIparser. - Configures and interacts with the
Genclass for AI communication. - Outputs results for specific elements like "number of subscribers" or "number of videos".
- Manages the parsing process using
- Key Methods:
AIparser.__init__: Initializes the parser with a URL and target element.AIparser.parse(element): Parses the given element and retrieves AI-generated results.
Target Audience
This tool is ideal for:
- Marketers and Analysts: For monitoring trends, gathering competitor data, and extracting insights.
- Small and Medium Businesses: To automate tasks like market monitoring or customer review aggregation.
- SEO Specialists: To analyze site content, keywords, and metadata.
- Developers and Freelancers: To speed up the execution of client parsing tasks.
- Journalists and Bloggers: To gather data for articles and posts effortlessly.
Limitations
- Speed: Processing time can take up to 45 seconds due to the AI generation.
- Dependencies: Requires an active internet connection and a valid API key.
- Scalability: Not optimized for high-frequency requests.
Potential Use Cases
- Monitoring changes on web pages.
- Extracting market research data.
- Analyzing competitors' content.
- Automating reporting tasks.
Future Improvements
- Optimize performance with batch processing and caching.
- Add support for local AI models to reduce dependency on external APIs.
- Expand parsing capabilities to include other data formats like JSON and XML.
- Develop a user-friendly interface (e.g., Telegram bot or web app).
Contributing
Feel free to contribute to the project by submitting issues or pull requests.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_html_parse-0.1.0.tar.gz.
File metadata
- Download URL: ai_html_parse-0.1.0.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4472864deffa6411917c7a2600f84049b012a7072b842c20cec3fd6bafdd914a
|
|
| MD5 |
f1c79790ae755dcfd2931c79ff390761
|
|
| BLAKE2b-256 |
92d023341fc59415780a017b839a671f9906374da3b3ea64c33738e38e0d6c10
|
File details
Details for the file ai_html_parse-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ai_html_parse-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
097a02d744a2dd72ee06cfe8cf4502880ebb08ed433af761cf42f3d5b4499ba0
|
|
| MD5 |
43b127ba002903f7d018cc1a69175465
|
|
| BLAKE2b-256 |
382a29cebf9c8f2b450aae02b98e105b20c9b8bf915b58b837ea96649c22ae7a
|