Skip to main content

The GoogleIt Python package offers a versatile set of tools for querying Google search results, downloading content, preprocessing text, converting HTML to PDF, and leveraging Google Palm 2 and Gemini language models for natural language processing tasks.

Project description

GoogleIt Python Package

The GoogleIt package provides a set of tools for querying Google search results, retrieving URLs, downloading content, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Installation

pip install GoogleIt

Usage

from GoogleIt.googleit import GoogleIt

# Create an instance of the GoogleIt class
google_it = GoogleIt(api_key='your_api_key_here')

# Perform a query and retrieve information
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Modules

  1. converter.py Documentation - Provides functionality for converting HTML or websites to PDF.

  2. models.py Documentation - Wrapper class for interacting with the Google Palm 2 language model.

  3. text_processor.py Documentation - Functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

  4. googleit.py Documentation - Main module encapsulating the GoogleIt class, which provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.

converter.py Documentation

GoogleIt Converter Module

This module provides functionality to convert HTML files or websites into PDF format using Selenium.

Usage:

- Import the module: `from GoogleIt import converter`
- Call the `convert` function with appropriate parameters.

Example:

converter.convert(source='https://example.com', target='output.pdf', timeout=5)

Functions:

- `convert(source: str, target: str, timeout: int = 2, print_options: dict = {}) -> None`:
    Converts a given HTML file or website into PDF.

    Parameters:
        - `source` (str): Source HTML file or website link.
        - `target` (str): Target location to save the PDF.
        - `timeout` (int, optional): Timeout in seconds. Default is set to 2 seconds.
        - `print_options` (dict, optional): Options for PDF printing. Refer to https://vanilla.aslushnikov.com/?Page.printToPDF for available options.

    Raises:
        - Exception: If an error occurs during PDF conversion.

Note:

This module relies on the Selenium library and requires a compatible WebDriver (e.g., ChromeDriver) to be installed.

models.py Documentation

This module provides wrapper classes for interacting with Google's language models, including Palm 2 and Gemini.

Palm2Model Class:


This class serves as a wrapper for the Google Palm 2 language model.

Attributes:

  • model: The initialized Palm 2 language model.

Methods:

  • init(self) -> None: Initializes the Palm2Model instance.

  • init(self, api_key: str) -> None: Initializes the Palm 2 language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Palm 2 language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

GeminiModel Class:


This class serves as a wrapper for the Google Gemini language model.

Attributes:

  • model: The initialized Gemini language model.

Methods:

  • init(self) -> None: Initializes the GeminiModel instance.

  • init(self, api_key: str) -> None: Initializes the Gemini language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Gemini language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

text_processor.py Documentation

GoogleIt Text Processor Module

This module provides functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

Usage:

- Import the module: `from GoogleIt import text_processor`
- Use the provided functions for text processing tasks.

Example:

pdf_path = "path/to/input.pdf"
docx_path = "path/to/output.docx"

# Convert PDF to DOCX
text_processor.pdf_to_docx(pdf_file=pdf_path, docx_file=docx_path)

# Read paragraphs from DOCX
paragraphs = text_processor.read_document_paragraphs(filename=docx_path)

# Divide paragraphs into chunks
chunked_paragraphs = text_processor.get_chunks(paragraphs=paragraphs, chunk_size=10, overlap_size=2)

# Extract text and paragraphs from PDF
pdf_text, pdf_paragraphs = text_processor.extract_text_from_pdf(pdf_path=pdf_path, docx_path=docx_path)

Functions:

- `pdf_to_docx(pdf_file: str, docx_file: str) -> None`:
    Converts a PDF file to a DOCX file.

- `read_document_paragraphs(filename: str) -> List[str]`:
    Reads paragraphs from a document (DOCX file).

- `get_chunks(paragraphs: List[str], chunk_size: int = 10, overlap_size: int = 2) -> List[str]`:
    Divides a list of paragraphs into chunks.

- `extract_text_from_pdf(pdf_path: str, docx_path: str = "converted_document.docx") -> Tuple[str, List[str]]`:
    Extracts text and paragraphs from a PDF file.

    Returns a tuple containing the extracted text and a list of paragraphs.

Note:

- The `get_chunks` function requires passing the list of paragraphs to the function.
- The module includes an example at the end demonstrating the use of the `extract_text_from_pdf` function.

googleit.py Documentation

GoogleIt Module

This module provides the GoogleIt class, which encapsulates functionality for performing queries, retrieving top URLs from Google search results, downloading content from URLs, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Usage:

- Import the module: `from GoogleIt.googleit import GoogleIt`
- Create an instance of the `GoogleIt` class with a valid API key.
- Use the provided methods for various tasks.

Example:

google_it = GoogleIt(api_key='your_api_key_here', model = "Palm2")
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Classes:

- `GoogleIt`:
    - A class that provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.
    - Methods:
        - `__init__(self, api_key: str, model: str = "Palm2") -> None`: Initializes the `GoogleIt` instance with the provided API key and a specified language model.
        - `save_url_to_pdf(self, url: str, pdf_path: str) -> None`: Downloads content from a URL and saves it as a PDF file.
        - `preprocess_text(self, text: str) -> str`: Preprocesses text by converting it to lowercase, tokenizing, and removing stopwords and punctuation.
        - `get_domain_name(self, url: str) -> str`: Extracts the domain name from a given URL.
        - `get_top_urls(self, query: str, urls_count: int = 5) -> Tuple[list[str], list[str]]`: Retrieves top URLs from Google search results based on a given query.
        - `combine_pdf(self, folder_path: str) -> str`: Combines multiple PDF files into a single merged PDF.
        - `extract_relevant_content(self, input_text: str, main_document: str, threshold: float = 0.2) -> str`: Extracts relevant content from the input text based on cosine similarity.
        - `with_document(self, query: str, google_doc: str, pdf_path: str) -> str`: Processes a query using a provided PDF document and a Google document.
        - `without_document(self, query: str, paragraphs: list[str]) -> str`: Processes a query without a provided PDF document.
        - `get(self, query: str, pdf_path: str | None = None, urls_count: int = 5) -> str`: Main function to retrieve information based on a query, optionally using a PDF document.

Attributes:

- `model` (GoogleIt attribute): An instance of the model class for natural language processing.

Note:

This module requires the `Palm2Model` class and `GeminiModel` from the `models` module for natural language processing.

Note:

  • Replace 'your_api_key_here' with your actual Google API key. =======

GoogleIt Python Package

The GoogleIt package provides a set of tools for querying Google search results, retrieving URLs, downloading content, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Installation

pip install GoogleIt

Usage

from GoogleIt.googleit import GoogleIt

# Create an instance of the GoogleIt class
google_it = GoogleIt(api_key='your_api_key_here')

# Perform a query and retrieve information
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Modules

  1. converter.py Documentation - Provides functionality for converting HTML or websites to PDF.

  2. models.py Documentation - Wrapper class for interacting with the Google Palm 2 language model.

  3. text_processor.py Documentation - Functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

  4. googleit.py Documentation - Main module encapsulating the GoogleIt class, which provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.

converter.py Documentation

GoogleIt Converter Module

This module provides functionality to convert HTML files or websites into PDF format using Selenium.

Usage:

- Import the module: `from GoogleIt import converter`
- Call the `convert` function with appropriate parameters.

Example:

converter.convert(source='https://example.com', target='output.pdf', timeout=5)

Functions:

- `convert(source: str, target: str, timeout: int = 2, print_options: dict = {}) -> None`:
    Converts a given HTML file or website into PDF.

    Parameters:
        - `source` (str): Source HTML file or website link.
        - `target` (str): Target location to save the PDF.
        - `timeout` (int, optional): Timeout in seconds. Default is set to 2 seconds.
        - `print_options` (dict, optional): Options for PDF printing. Refer to https://vanilla.aslushnikov.com/?Page.printToPDF for available options.

    Raises:
        - Exception: If an error occurs during PDF conversion.

Note:

This module relies on the Selenium library and requires a compatible WebDriver (e.g., ChromeDriver) to be installed.

models.py Documentation

This module provides wrapper classes for interacting with Google's language models, including Palm 2 and Gemini.

Palm2Model Class:


This class serves as a wrapper for the Google Palm 2 language model.

Attributes:

  • model: The initialized Palm 2 language model.

Methods:

  • init(self) -> None: Initializes the Palm2Model instance.

  • init(self, api_key: str) -> None: Initializes the Palm 2 language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Palm 2 language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

GeminiModel Class:


This class serves as a wrapper for the Google Gemini language model.

Attributes:

  • model: The initialized Gemini language model.

Methods:

  • init(self) -> None: Initializes the GeminiModel instance.

  • init(self, api_key: str) -> None: Initializes the Gemini language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Gemini language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

text_processor.py Documentation

GoogleIt Text Processor Module

This module provides functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

Usage:

- Import the module: `from GoogleIt import text_processor`
- Use the provided functions for text processing tasks.

Example:

pdf_path = "path/to/input.pdf"
docx_path = "path/to/output.docx"

# Convert PDF to DOCX
text_processor.pdf_to_docx(pdf_file=pdf_path, docx_file=docx_path)

# Read paragraphs from DOCX
paragraphs = text_processor.read_document_paragraphs(filename=docx_path)

# Divide paragraphs into chunks
chunked_paragraphs = text_processor.get_chunks(paragraphs=paragraphs, chunk_size=10, overlap_size=2)

# Extract text and paragraphs from PDF
pdf_text, pdf_paragraphs = text_processor.extract_text_from_pdf(pdf_path=pdf_path, docx_path=docx_path)

Functions:

- `pdf_to_docx(pdf_file: str, docx_file: str) -> None`:
    Converts a PDF file to a DOCX file.

- `read_document_paragraphs(filename: str) -> List[str]`:
    Reads paragraphs from a document (DOCX file).

- `get_chunks(paragraphs: List[str], chunk_size: int = 10, overlap_size: int = 2) -> List[str]`:
    Divides a list of paragraphs into chunks.

- `extract_text_from_pdf(pdf_path: str, docx_path: str = "converted_document.docx") -> Tuple[str, List[str]]`:
    Extracts text and paragraphs from a PDF file.

    Returns a tuple containing the extracted text and a list of paragraphs.

Note:

- The `get_chunks` function requires passing the list of paragraphs to the function.
- The module includes an example at the end demonstrating the use of the `extract_text_from_pdf` function.

googleit.py Documentation

GoogleIt Module

This module provides the GoogleIt class, which encapsulates functionality for performing queries, retrieving top URLs from Google search results, downloading content from URLs, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Usage:

- Import the module: `from GoogleIt.googleit import GoogleIt`
- Create an instance of the `GoogleIt` class with a valid API key.
- Use the provided methods for various tasks.

Example:

google_it = GoogleIt(api_key='your_api_key_here', model = "Palm2")
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Classes:

- `GoogleIt`:
    - A class that provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.
    - Methods:
        - `__init__(self, api_key: str, model: str = "Palm2") -> None`: Initializes the `GoogleIt` instance with the provided API key and a specified language model.
        - `save_url_to_pdf(self, url: str, pdf_path: str) -> None`: Downloads content from a URL and saves it as a PDF file.
        - `preprocess_text(self, text: str) -> str`: Preprocesses text by converting it to lowercase, tokenizing, and removing stopwords and punctuation.
        - `get_domain_name(self, url: str) -> str`: Extracts the domain name from a given URL.
        - `get_top_urls(self, query: str, urls_count: int = 5) -> Tuple[list[str], list[str]]`: Retrieves top URLs from Google search results based on a given query.
        - `combine_pdf(self, folder_path: str) -> str`: Combines multiple PDF files into a single merged PDF.
        - `extract_relevant_content(self, input_text: str, main_document: str, threshold: float = 0.2) -> str`: Extracts relevant content from the input text based on cosine similarity.
        - `with_document(self, query: str, google_doc: str, pdf_path: str) -> str`: Processes a query using a provided PDF document and a Google document.
        - `without_document(self, query: str, paragraphs: list[str]) -> str`: Processes a query without a provided PDF document.
        - `get(self, query: str, pdf_path: str | None = None, urls_count: int = 5) -> str`: Main function to retrieve information based on a query, optionally using a PDF document.

Attributes:

- `model` (GoogleIt attribute): An instance of the model class for natural language processing.

Note:

This module requires the `Palm2Model` class and `GeminiModel` from the `models` module for natural language processing.

Note:

  • Replace 'your_api_key_here' with your actual Google API key.

You can get the Google API key from https://makersuite.google.com/app/apikey.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleIt-2.0.2.tar.gz (17.4 kB view hashes)

Uploaded Source

Built Distribution

GoogleIt-2.0.2-py3-none-any.whl (16.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page