Skip to main content

A module to find links through various search engines

Project description

SERPEngine

SERPEngine Production grade search module to find links through search engines.

  • uses google search API
  • made for production. You need API keys
  • includes various filters including LLM based one. So you can filter the links based on domain, metadata,

Installation

  1. Clone the Repository:

    pip install serpengine [repo]
    
  2. Usage:

from serpengine import SERPEngine

# Initialize the searcher
serpengine = SERPEngine()

    result_data = serpengine.collect(
        query="best food in USA",
        num_urls=5,
        search_sources=["google_search_via_api"],
        regex_based_link_validation=False,             
        allow_links_forwarding_to_files=False,        
        output_format="json"  # or "linksearch"
    )
    print(result_data)

Getting Google Credentials:

Create or Select a Google Cloud Project: Go to the Google Cloud Console. Create a new project (or select an existing one). Enable the Custom Search API:

In the Cloud Console, navigate to APIs & Services > Library. Search for "Custom Search API". Click on it and then press the "Enable" button. Create Credentials (API Key):

Once the API is enabled, go to APIs & Services > Credentials in the sidebar. Click "Create Credentials" and choose "API key". A dialog will display your new API key. Copy this key.

Getting the Custom Search Engine ID (GOOGLE_CSE_ID) This ID tells the API which search engine configuration to use.

Visit the Google Custom Search Engine (CSE) Site:

Go to Google Custom Search Engine. Create a Custom Search Engine:

Click "Add" or "New Search Engine". In the "Sites to search" field, you can either enter a specific website (if you want to restrict the search) or enter a placeholder like *.com to allow broader searches. Fill in the other required fields (like a name for your search engine) and click "Create". Retrieve Your CSE ID:

Once your search engine is created, go to its Control Panel.

Look for the "Search engine ID" (often labeled as cx). It will be a string of characters.

Copy this ID.

Parameters

  • query (str): The search query.
  • validation_conditions (Dict, optional): Additional validation rules for filtering links.
  • num_urls (int): Number of links to retrieve.
  • search_sources (List[str]): Search sources to use (e.g., "google_search_via_api", "google_search_via_request_module").
  • allowed_countries (List[str], optional): List of country codes to allow.
  • forbidden_countries (List[str], optional): List of country codes to forbid.
  • allowed_domains (List[str], optional): List of domains to allow.
  • forbidden_domains (List[str], optional): List of domains to block.
  • filter_llm (bool, optional): Whether to use AI-based filtering.
  • output_format (str): Output format, either "json" or "linksearch".

Output

  • JSON Format:

    {
        "operation_result": {
            "total_time": 1.234,
            "errors": []
        },
        "results": [
            {
                "link": "https://digikey.com/product1",
                "metadata": "",
                "title": ""
            },
            ...
        ]
    }
    
  • LinkSearch Objects:

    A list of LinkSearch objects with attributes link, metadata, and title.

Features

  • Search Modules:

    • Simple Google Search Module: Scrapes Google search results directly from the HTML.
    • Google Search API Module: Utilizes the Google Custom Search API for fetching search results.
  • Filters:

    • Allowed Domains:

      • Description: Restricts search results to specified domains. For example, setting allowed_domains=["digikey.com"] ensures only links from Digi-Key are collected.
    • Keyword Match Based Link Validation:

      • Description: Ensures that the collected links contain specific keywords. For instance, keyword_match_based_link_validation=["STM32"] filters out any links that do not include the keyword "STM32".
    • Allowed Countries (Optional):

      • Description: Filters links based on the top-level domain (TLD) to include only those from specified countries.
    • Forbidden Countries (Optional):

      • Description: Excludes links from specified countries based on their TLD.
    • Additional Validation Conditions:

      • Description: Allows for custom validation logic to further filter links based on user-defined criteria.
  • Output Formats:

    • JSON: Provides a structured dictionary containing operation results and the list of collected links.
    • LinkSearch Objects: Returns a list of LinkSearch dataclass instances for flexible manipulation within Python.
  • Error Handling and Logging:

    • Captures and logs errors encountered during the search and filtering processes, facilitating easier debugging and maintenance.
  • Extensibility:

    • Designed to be easily extendable, allowing integration of additional search sources or more sophisticated filtering mechanisms as needed.

Requirements

Ensure you have the following dependencies installed. They are listed in the requirements.txt file:

requests>=2.25.1
python-dotenv>=0.19.0
beautifulsoup4>=4.9.3

You can install them via:

pip install -r requirements.txt

Configuration

Before using the Link Search Agent, set up your environment variables:

  1. Create a .env File:

    GOOGLE_API_KEY=your_google_api_key
    GOOGLE_CSE_ID=your_custom_search_engine_id
    
  2. Ensure the .env File is in the Root Directory or the Directory Where the Script Runs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serpengine-0.0.3.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

serpengine-0.0.3-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file serpengine-0.0.3.tar.gz.

File metadata

  • Download URL: serpengine-0.0.3.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for serpengine-0.0.3.tar.gz
Algorithm Hash digest
SHA256 9493263d7759cba03ac2eba3571cd99a3366b8619c4fa8c66fa342be4c5cd080
MD5 96a222d7c453e6c6efea7353295a63f6
BLAKE2b-256 6f0aa7dddda36a2b65b131f8b616cd105fe20f23983d1d55985d9e1ccdb633f6

See more details on using hashes here.

File details

Details for the file serpengine-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: serpengine-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for serpengine-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5cb201ab315cac8ceae3ec30e7a23d9faef0305d6f75ca99b27c5712180711b4
MD5 573e0e297a51823daf8027e2d1b15c4c
BLAKE2b-256 2fc0e636e96c5f8c2f4d94e4ba0d656ac0b5fbc71a2a5728c51bd39364105029

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page