Skip to main content

Tool that uses Perplexity and OpenAI to search with SERP and filter for relevant URLs.

Project description

ai_url_aggregator

Note: This is a small experimental library, provided as-is.

ai_url_aggregator is a Python tool that leverages Perplexity and OpenAI to search the internet for relevant URLs, filter and deduplicate them, check their availability, and then select the most important ones based on GPT analysis.


Features

  1. Search Across Models
    Uses Perplexity’s sonar-reasoning model to query the internet for URLs related to your prompt.
  2. Clean & Filter
    • Prefers https:// links when both http:// and https:// are found for the same domain.
    • Removes duplicates by collecting results into a set.
  3. Online Check
    • Verifies each URL’s availability (status codes 200 or 403) using requests.
  4. Relevance Ranking
    • Uses an OpenAI model to select the most important websites from the deduplicated list of online URLs.

DeepWiki Docs: https://deepwiki.com/carlosplanchon/ai_url_aggregator


Installation

1. Install via PyPI

uv add ai_url_aggregator

2. Set Environment Variables

You must provide your Perplexity and OpenAI API keys:

export PERPLEXITY_API_KEY="PERPLEXITY_API_KEY"
export OPENAI_API_KEY="OPENAI_API_KEY"

Replace "PERPLEXITY_API_KEY" and "OPENAI_API_KEY" with your actual API keys.

3. (Optional) Install from Source

  1. Clone or Download this repository.
  2. Install Dependencies:
    uv sync
    
    This ensures all required libraries (like openai, requests, etc.) are installed.

How It Works

  1. query_models(query: str) -> list[str]

    • Sends a query to Perplexity’s sonar-reasoning model.
    • Parses the Perplexity output with an OpenAI model into a structured list of URLs.
  2. keep_https(urls: list[str]) -> list[str]

    • Selects https:// versions of URLs when duplicates exist, else keeps http://.
  3. execute_query_multiple_times(query: str, num_runs: int) -> list[str]

    • Runs the query multiple times to gather more URLs.
    • Deduplicates results using a set.
  4. check_urls_online(urls: list[str]) -> list[str]

    • Pings each URL to see if it’s reachable (status 200 or 403).
  5. search_for_web_urls(query: str, num_runs: int) -> list[str]

    • Brings all the above together:
      1. Executes a query multiple times.
      2. Prefers HTTPS versions of each domain.
      3. Verifies URL reachability.
      4. Returns a final list of online, deduplicated URLs.
  6. get_top_relevant_websites(website_urls: list[str]) -> list[Website]

    • Uses an OpenAI model to select the most relevant (important) websites from the final list of URLs.

Usage Example

Once installed and your environment variables are set, you can do:

import prettyprinter
from ai_url_aggregator import (
    search_for_web_urls,
    get_top_relevant_websites
)

# Optional: install prettyprinter extras for nicer output
prettyprinter.install_extras()

# Example query:
query = "Give me a list of all the real state agencies in Uruguay."

# Step 1: Get a cleaned, deduplicated, and verified list of URLs
online_urls = search_for_web_urls(query=query)

print("--- Online URLs ---")
prettyprinter.cpprint(online_urls)

# Step 2: Get the most important websites from the final list
most_important_websites = get_top_relevant_websites(website_urls=online_urls)

print("--- Most Important Websites ---")
prettyprinter.cpprint(most_important_websites)

Result (Main real state agencies in Uruguay):

[
    'https://www.infocasas.com.uy',
    'https://www.casasweb.com.uy',
    'https://www.mercadolibre.com.uy/inmuebles',
    'https://www.uruguayinmobiliarias.com'
]

License

This project is distributed under the MIT License. See LICENSE for more information.


All suggestions and improvements are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_url_aggregator-0.2.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai_url_aggregator-0.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file ai_url_aggregator-0.2.tar.gz.

File metadata

  • Download URL: ai_url_aggregator-0.2.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for ai_url_aggregator-0.2.tar.gz
Algorithm Hash digest
SHA256 02364f49165a08cd26755724d7e60bf3f55a85b573aa846b115d9a7f324b371c
MD5 e8bcd2a1b4e62fd76bcf6ef1fa9a513e
BLAKE2b-256 114c58cd4678197164d4f4b0fdff32940008e502eff074a3bac37210de6b54e8

See more details on using hashes here.

File details

Details for the file ai_url_aggregator-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ai_url_aggregator-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dbbdd2b57276ca44f531f98e6b89232fda508969d5bc3c43db0e12fd83fd0d61
MD5 d80fe26278b58c3f3364ef8efb49f992
BLAKE2b-256 4af1724b01fb2a0e86639a3f62ec0f445b1bccd5c4e537c6a41cf3666406382c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page