Skip to main content

AI HTML Parser

Project description

AI-Powered HTML Parser

Installation

Prerequisites

  • Python 3.8 or higher
  • Required Libraries:
    • requests
    • bs4 (BeautifulSoup)

Steps

  1. Clone the repository:
    git clone https://github.com/pythonshik/ai-html-parser.git
    cd ai-html-parser
    
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Set up your API key for Google Gemini:
    • Create a folder named AI in the root directory.
    • Add your API key to a file named gemini_api_key inside the AI folder.

Usage

Example

  1. Import the AIparser class:
    from AIparse import AIparser
    
  2. Initialize the parser with a URL:
    element = AIparser("https://www.youtube.com/@PythonShik")
    
  3. Parse specific elements:
    for i in ["number of videos", "number of subscribers"]:
        parsed_data = element.parse(i)
        print(f"{parsed_data['explain']}: {parsed_data['value']}")
    
  4. Output example:
    {
      "value": "96",
      "explain": "Number of subscribers",
      "result": "96 subscribers"
    }
    

Overview

This project is an AI-powered HTML parser designed to extract specific data from web pages using Google Gemini's text generation API. The parser processes the HTML source code of a webpage, identifies specific elements, and returns the desired information in a structured JSON format.

Key Features

  • AI Integration: Utilizes Google Gemini for intelligent text analysis.
  • HTML Parsing: Extracts and processes HTML elements using BeautifulSoup.
  • Customizable Instructions: Supports user-defined parsing instructions.
  • JSON Output: Provides clear and structured results in JSON format.

How It Works

  1. User Input: Provide a URL and the target element to parse.
  2. HTML Fetching: The tool fetches the HTML source code of the webpage.
  3. AI Analysis: The HTML source and target element are sent to the AI for processing.
  4. JSON Output: The AI generates a structured response containing the extracted information.

File Descriptions

1. BASE.py

The core class for interacting with Google Gemini's text generation API.

  • Features:
    • API key management.
    • Methods for adding and managing conversation history.
    • Text generation using the generate() method.
  • Key Methods:
    • history_add(role, content): Adds messages to the conversation history.
    • generate(): Sends data to gemini API and retrieves the generated text.
    • export_history(filename): Saves conversation history to a file.
    • import_history(filename): Loads conversation history from a file.
    • clear_history(filename): Clears the conversation history.

2. prompts.py

Defines the instruction format for AI tasks.

  • Key Class: Instructions
    • first_instruction: Provides a detailed guide for parsing HTML elements and formatting the response.

3. main.py

The main entry point for the application.

  • Features:
    • Manages the parsing process using AIparser.
    • Configures and interacts with the Gen class for AI communication.
    • Outputs results for specific elements like "number of subscribers" or "number of videos".
  • Key Methods:
    • AIparser.__init__: Initializes the parser with a URL and target element.
    • AIparser.parse(element): Parses the given element and retrieves AI-generated results.

Target Audience

This tool is ideal for:

  • Marketers and Analysts: For monitoring trends, gathering competitor data, and extracting insights.
  • Small and Medium Businesses: To automate tasks like market monitoring or customer review aggregation.
  • SEO Specialists: To analyze site content, keywords, and metadata.
  • Developers and Freelancers: To speed up the execution of client parsing tasks.
  • Journalists and Bloggers: To gather data for articles and posts effortlessly.

Limitations

  • Speed: Processing time can take up to 45 seconds due to the AI generation.
  • Dependencies: Requires an active internet connection and a valid API key.
  • Scalability: Not optimized for high-frequency requests.

Potential Use Cases

  • Monitoring changes on web pages.
  • Extracting market research data.
  • Analyzing competitors' content.
  • Automating reporting tasks.

Future Improvements

  • Optimize performance with batch processing and caching.
  • Add support for local AI models to reduce dependency on external APIs.
  • Expand parsing capabilities to include other data formats like JSON and XML.
  • Develop a user-friendly interface (e.g., Telegram bot or web app).

Contributing

Feel free to contribute to the project by submitting issues or pull requests.


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_html_parse-0.1.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_html_parse-0.1.0-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file ai_html_parse-0.1.0.tar.gz.

File metadata

  • Download URL: ai_html_parse-0.1.0.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for ai_html_parse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4472864deffa6411917c7a2600f84049b012a7072b842c20cec3fd6bafdd914a
MD5 f1c79790ae755dcfd2931c79ff390761
BLAKE2b-256 92d023341fc59415780a017b839a671f9906374da3b3ea64c33738e38e0d6c10

See more details on using hashes here.

File details

Details for the file ai_html_parse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ai_html_parse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for ai_html_parse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 097a02d744a2dd72ee06cfe8cf4502880ebb08ed433af761cf42f3d5b4499ba0
MD5 43b127ba002903f7d018cc1a69175465
BLAKE2b-256 382a29cebf9c8f2b450aae02b98e105b20c9b8bf915b58b837ea96649c22ae7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page