Skip to main content

The GoogleIt Python package offers a versatile set of tools for querying Google search results, downloading content, preprocessing text, converting HTML to PDF, and leveraging Google Palm 2 and Gemini language models for natural language processing tasks.

Project description

GoogleIt Python Package

The GoogleIt package provides a set of tools for querying Google search results, retrieving URLs, downloading content, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Installation

pip install GoogleIt

Usage

from GoogleIt.googleit import GoogleIt

# Create an instance of the GoogleIt class
google_it = GoogleIt(api_key='your_api_key_here')

# Perform a query and retrieve information
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Modules

  1. converter.py Documentation - Provides functionality for converting HTML or websites to PDF.

  2. models.py Documentation - Wrapper class for interacting with the Google Palm 2 language model.

  3. text_processor.py Documentation - Functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

  4. googleit.py Documentation - Main module encapsulating the GoogleIt class, which provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.

converter.py Documentation

GoogleIt Converter Module

This module provides functionality to convert HTML files or websites into PDF format using Selenium.

Usage:

- Import the module: `from GoogleIt import converter`
- Call the `convert` function with appropriate parameters.

Example:

converter.convert(source='https://example.com', target='output.pdf', timeout=5)

Functions:

- `convert(source: str, target: str, timeout: int = 2, print_options: dict = {}) -> None`:
    Converts a given HTML file or website into PDF.

    Parameters:
        - `source` (str): Source HTML file or website link.
        - `target` (str): Target location to save the PDF.
        - `timeout` (int, optional): Timeout in seconds. Default is set to 2 seconds.
        - `print_options` (dict, optional): Options for PDF printing. Refer to https://vanilla.aslushnikov.com/?Page.printToPDF for available options.

    Raises:
        - Exception: If an error occurs during PDF conversion.

Note:

This module relies on the Selenium library and requires a compatible WebDriver (e.g., ChromeDriver) to be installed.

models.py Documentation

This module provides wrapper classes for interacting with Google's language models, including Palm 2 and Gemini.

Palm2Model Class:


This class serves as a wrapper for the Google Palm 2 language model.

Attributes:

  • model: The initialized Palm 2 language model.

Methods:

  • init(self) -> None: Initializes the Palm2Model instance.

  • init(self, api_key: str) -> None: Initializes the Palm 2 language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Palm 2 language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

GeminiModel Class:


This class serves as a wrapper for the Google Gemini language model.

Attributes:

  • model: The initialized Gemini language model.

Methods:

  • init(self) -> None: Initializes the GeminiModel instance.

  • init(self, api_key: str) -> None: Initializes the Gemini language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Gemini language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

text_processor.py Documentation

GoogleIt Text Processor Module

This module provides functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

Usage:

- Import the module: `from GoogleIt import text_processor`
- Use the provided functions for text processing tasks.

Example:

pdf_path = "path/to/input.pdf"
docx_path = "path/to/output.docx"

# Convert PDF to DOCX
text_processor.pdf_to_docx(pdf_file=pdf_path, docx_file=docx_path)

# Read paragraphs from DOCX
paragraphs = text_processor.read_document_paragraphs(filename=docx_path)

# Divide paragraphs into chunks
chunked_paragraphs = text_processor.get_chunks(paragraphs=paragraphs, chunk_size=10, overlap_size=2)

# Extract text and paragraphs from PDF
pdf_text, pdf_paragraphs = text_processor.extract_text_from_pdf(pdf_path=pdf_path, docx_path=docx_path)

Functions:

- `pdf_to_docx(pdf_file: str, docx_file: str) -> None`:
    Converts a PDF file to a DOCX file.

- `read_document_paragraphs(filename: str) -> List[str]`:
    Reads paragraphs from a document (DOCX file).

- `get_chunks(paragraphs: List[str], chunk_size: int = 10, overlap_size: int = 2) -> List[str]`:
    Divides a list of paragraphs into chunks.

- `extract_text_from_pdf(pdf_path: str, docx_path: str = "converted_document.docx") -> Tuple[str, List[str]]`:
    Extracts text and paragraphs from a PDF file.

    Returns a tuple containing the extracted text and a list of paragraphs.

Note:

- The `get_chunks` function requires passing the list of paragraphs to the function.
- The module includes an example at the end demonstrating the use of the `extract_text_from_pdf` function.

googleit.py Documentation

GoogleIt Module

This module provides the GoogleIt class, which encapsulates functionality for performing queries, retrieving top URLs from Google search results, downloading content from URLs, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Usage:

- Import the module: `from GoogleIt.googleit import GoogleIt`
- Create an instance of the `GoogleIt` class with a valid API key.
- Use the provided methods for various tasks.

Example:

google_it = GoogleIt(api_key='your_api_key_here', model = "Palm2")
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Classes:

- `GoogleIt`:
    - A class that provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.
    - Methods:
        - `__init__(self, api_key: str, model: str = "Palm2") -> None`: Initializes the `GoogleIt` instance with the provided API key and a specified language model.
        - `save_url_to_pdf(self, url: str, pdf_path: str) -> None`: Downloads content from a URL and saves it as a PDF file.
        - `preprocess_text(self, text: str) -> str`: Preprocesses text by converting it to lowercase, tokenizing, and removing stopwords and punctuation.
        - `get_domain_name(self, url: str) -> str`: Extracts the domain name from a given URL.
        - `get_top_urls(self, query: str, urls_count: int = 5) -> Tuple[list[str], list[str]]`: Retrieves top URLs from Google search results based on a given query.
        - `combine_pdf(self, folder_path: str) -> str`: Combines multiple PDF files into a single merged PDF.
        - `extract_relevant_content(self, input_text: str, main_document: str, threshold: float = 0.2) -> str`: Extracts relevant content from the input text based on cosine similarity.
        - `with_document(self, query: str, google_doc: str, pdf_path: str) -> str`: Processes a query using a provided PDF document and a Google document.
        - `without_document(self, query: str, paragraphs: list[str]) -> str`: Processes a query without a provided PDF document.
        - `get(self, query: str, pdf_path: str | None = None, urls_count: int = 5) -> str`: Main function to retrieve information based on a query, optionally using a PDF document.

Attributes:

- `model` (GoogleIt attribute): An instance of the model class for natural language processing.

Note:

This module requires the `Palm2Model` class and `GeminiModel` from the `models` module for natural language processing.

Note:

  • Replace 'your_api_key_here' with your actual Google API key. =======

GoogleIt Python Package

The GoogleIt package provides a set of tools for querying Google search results, retrieving URLs, downloading content, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Installation

pip install GoogleIt

Usage

from GoogleIt.googleit import GoogleIt

# Create an instance of the GoogleIt class
google_it = GoogleIt(api_key='your_api_key_here')

# Perform a query and retrieve information
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Modules

  1. converter.py Documentation - Provides functionality for converting HTML or websites to PDF.

  2. models.py Documentation - Wrapper class for interacting with the Google Palm 2 language model.

  3. text_processor.py Documentation - Functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

  4. googleit.py Documentation - Main module encapsulating the GoogleIt class, which provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.

converter.py Documentation

GoogleIt Converter Module

This module provides functionality to convert HTML files or websites into PDF format using Selenium.

Usage:

- Import the module: `from GoogleIt import converter`
- Call the `convert` function with appropriate parameters.

Example:

converter.convert(source='https://example.com', target='output.pdf', timeout=5)

Functions:

- `convert(source: str, target: str, timeout: int = 2, print_options: dict = {}) -> None`:
    Converts a given HTML file or website into PDF.

    Parameters:
        - `source` (str): Source HTML file or website link.
        - `target` (str): Target location to save the PDF.
        - `timeout` (int, optional): Timeout in seconds. Default is set to 2 seconds.
        - `print_options` (dict, optional): Options for PDF printing. Refer to https://vanilla.aslushnikov.com/?Page.printToPDF for available options.

    Raises:
        - Exception: If an error occurs during PDF conversion.

Note:

This module relies on the Selenium library and requires a compatible WebDriver (e.g., ChromeDriver) to be installed.

models.py Documentation

This module provides wrapper classes for interacting with Google's language models, including Palm 2 and Gemini.

Palm2Model Class:


This class serves as a wrapper for the Google Palm 2 language model.

Attributes:

  • model: The initialized Palm 2 language model.

Methods:

  • init(self) -> None: Initializes the Palm2Model instance.

  • init(self, api_key: str) -> None: Initializes the Palm 2 language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Palm 2 language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Palm 2 language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

GeminiModel Class:


This class serves as a wrapper for the Google Gemini language model.

Attributes:

  • model: The initialized Gemini language model.

Methods:

  • init(self) -> None: Initializes the GeminiModel instance.

  • init(self, api_key: str) -> None: Initializes the Gemini language model using the provided API key.

    Parameters:

    • api_key (str): The API key for authentication.
  • make_prompt(self, query: str, relevant_passage: str) -> str: Generates a prompt for the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • relevant_passage (str): The relevant passage for context.

    Returns: str: The formatted prompt for the language model.

  • redraft_response(self, query: str, response: str) -> str: Redrafts the response generated by the Gemini language model.

    Parameters:

    • query (str): The user's question.
    • response (str): The generated response.

    Returns: str: The redrafted response.

  • query(self, document: str, question: str) -> str: Queries the Gemini language model for an answer.

    Parameters:

    • document (str): The reference document for context.
    • question (str): The user's question.

    Returns: str: The generated answer from the language model.

text_processor.py Documentation

GoogleIt Text Processor Module

This module provides functions for processing text documents, including converting PDF to DOCX, reading paragraphs from DOCX, dividing paragraphs into chunks, and extracting text and paragraphs from PDF.

Usage:

- Import the module: `from GoogleIt import text_processor`
- Use the provided functions for text processing tasks.

Example:

pdf_path = "path/to/input.pdf"
docx_path = "path/to/output.docx"

# Convert PDF to DOCX
text_processor.pdf_to_docx(pdf_file=pdf_path, docx_file=docx_path)

# Read paragraphs from DOCX
paragraphs = text_processor.read_document_paragraphs(filename=docx_path)

# Divide paragraphs into chunks
chunked_paragraphs = text_processor.get_chunks(paragraphs=paragraphs, chunk_size=10, overlap_size=2)

# Extract text and paragraphs from PDF
pdf_text, pdf_paragraphs = text_processor.extract_text_from_pdf(pdf_path=pdf_path, docx_path=docx_path)

Functions:

- `pdf_to_docx(pdf_file: str, docx_file: str) -> None`:
    Converts a PDF file to a DOCX file.

- `read_document_paragraphs(filename: str) -> List[str]`:
    Reads paragraphs from a document (DOCX file).

- `get_chunks(paragraphs: List[str], chunk_size: int = 10, overlap_size: int = 2) -> List[str]`:
    Divides a list of paragraphs into chunks.

- `extract_text_from_pdf(pdf_path: str, docx_path: str = "converted_document.docx") -> Tuple[str, List[str]]`:
    Extracts text and paragraphs from a PDF file.

    Returns a tuple containing the extracted text and a list of paragraphs.

Note:

- The `get_chunks` function requires passing the list of paragraphs to the function.
- The module includes an example at the end demonstrating the use of the `extract_text_from_pdf` function.

googleit.py Documentation

GoogleIt Module

This module provides the GoogleIt class, which encapsulates functionality for performing queries, retrieving top URLs from Google search results, downloading content from URLs, preprocessing text, extracting domain names from URLs, combining PDF files, and extracting relevant content based on cosine similarity.

Usage:

- Import the module: `from GoogleIt.googleit import GoogleIt`
- Create an instance of the `GoogleIt` class with a valid API key.
- Use the provided methods for various tasks.

Example:

google_it = GoogleIt(api_key='your_api_key_here', model = "Palm2")
query = "How does photosynthesis work?"
response = google_it.get(query=query, urls_count=5)
print(response)

Classes:

- `GoogleIt`:
    - A class that provides functionality for querying, retrieving URLs, downloading content, preprocessing text, and more.
    - Methods:
        - `__init__(self, api_key: str, model: str = "Palm2") -> None`: Initializes the `GoogleIt` instance with the provided API key and a specified language model.
        - `save_url_to_pdf(self, url: str, pdf_path: str) -> None`: Downloads content from a URL and saves it as a PDF file.
        - `preprocess_text(self, text: str) -> str`: Preprocesses text by converting it to lowercase, tokenizing, and removing stopwords and punctuation.
        - `get_domain_name(self, url: str) -> str`: Extracts the domain name from a given URL.
        - `get_top_urls(self, query: str, urls_count: int = 5) -> Tuple[list[str], list[str]]`: Retrieves top URLs from Google search results based on a given query.
        - `combine_pdf(self, folder_path: str) -> str`: Combines multiple PDF files into a single merged PDF.
        - `extract_relevant_content(self, input_text: str, main_document: str, threshold: float = 0.2) -> str`: Extracts relevant content from the input text based on cosine similarity.
        - `with_document(self, query: str, google_doc: str, pdf_path: str) -> str`: Processes a query using a provided PDF document and a Google document.
        - `without_document(self, query: str, paragraphs: list[str]) -> str`: Processes a query without a provided PDF document.
        - `get(self, query: str, pdf_path: str | None = None, urls_count: int = 5) -> str`: Main function to retrieve information based on a query, optionally using a PDF document.

Attributes:

- `model` (GoogleIt attribute): An instance of the model class for natural language processing.

Note:

This module requires the `Palm2Model` class and `GeminiModel` from the `models` module for natural language processing.

Note:

  • Replace 'your_api_key_here' with your actual Google API key.

You can get the Google API key from https://makersuite.google.com/app/apikey.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleIt-2.0.2.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

GoogleIt-2.0.2-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file GoogleIt-2.0.2.tar.gz.

File metadata

  • Download URL: GoogleIt-2.0.2.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for GoogleIt-2.0.2.tar.gz
Algorithm Hash digest
SHA256 282487ccd2fd4237c9389aa7f3bdb4d4319b5eb52394435002be87b020d0a3f9
MD5 e28d84c190218a9adab30bd5c82ac4dc
BLAKE2b-256 dd07de4615ee5130bd0bbc0260716f2c67ab1af0577b724636c8095bc52c047b

See more details on using hashes here.

File details

Details for the file GoogleIt-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: GoogleIt-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for GoogleIt-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4850748c2a8a8d2598002114ac372a00ddfc18453cc889fdd8d122de5e0d3ef4
MD5 92e53edeced8736cef670f9b88b43a6e
BLAKE2b-256 5a458c3542ee3f848f65801e84ea44635e37ac4e82e8203b2ad7bb0df780509f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page