Tool that uses Perplexity and OpenAI to search with SERP and filter for relevant URLs.
Project description
ai_url_aggregator
Note: This is a small experimental library, provided as-is.
ai_url_aggregator is a Python tool that leverages Perplexity and OpenAI to search the internet for relevant URLs, filter and deduplicate them, check their availability, and then select the most important ones based on GPT analysis.
Features
- Search Across Models
Uses Perplexity’ssonar-reasoningmodel to query the internet for URLs related to your prompt. - Clean & Filter
- Prefers
https://links when bothhttp://andhttps://are found for the same domain. - Removes duplicates by collecting results into a
set.
- Prefers
- Online Check
- Verifies each URL’s availability (status codes
200or403) usingrequests.
- Verifies each URL’s availability (status codes
- Relevance Ranking
- Uses an OpenAI model to select the most important websites from the deduplicated list of online URLs.
DeepWiki Docs: https://deepwiki.com/carlosplanchon/ai_url_aggregator
Installation
1. Install via PyPI
uv add ai_url_aggregator
2. Set Environment Variables
You must provide your Perplexity and OpenAI API keys:
export PERPLEXITY_API_KEY="PERPLEXITY_API_KEY"
export OPENAI_API_KEY="OPENAI_API_KEY"
Replace "PERPLEXITY_API_KEY" and "OPENAI_API_KEY" with your actual API keys.
3. (Optional) Install from Source
- Clone or Download this repository.
- Install Dependencies:
uv syncThis ensures all required libraries (likeopenai,requests, etc.) are installed.
How It Works
-
query_models(query: str) -> list[str]- Sends a query to Perplexity’s
sonar-reasoningmodel. - Parses the Perplexity output with an OpenAI model into a structured list of URLs.
- Sends a query to Perplexity’s
-
keep_https(urls: list[str]) -> list[str]- Selects
https://versions of URLs when duplicates exist, else keepshttp://.
- Selects
-
execute_query_multiple_times(query: str, num_runs: int) -> list[str]- Runs the query multiple times to gather more URLs.
- Deduplicates results using a
set.
-
check_urls_online(urls: list[str]) -> list[str]- Pings each URL to see if it’s reachable (status
200or403).
- Pings each URL to see if it’s reachable (status
-
search_for_web_urls(query: str, num_runs: int) -> list[str]- Brings all the above together:
- Executes a query multiple times.
- Prefers HTTPS versions of each domain.
- Verifies URL reachability.
- Returns a final list of online, deduplicated URLs.
- Brings all the above together:
-
get_top_relevant_websites(website_urls: list[str]) -> list[Website]- Uses an OpenAI model to select the most relevant (important) websites from the final list of URLs.
Usage Example
Once installed and your environment variables are set, you can do:
import prettyprinter
from ai_url_aggregator import (
search_for_web_urls,
get_top_relevant_websites
)
# Optional: install prettyprinter extras for nicer output
prettyprinter.install_extras()
# Example query:
query = "Give me a list of all the real state agencies in Uruguay."
# Step 1: Get a cleaned, deduplicated, and verified list of URLs
online_urls = search_for_web_urls(query=query)
print("--- Online URLs ---")
prettyprinter.cpprint(online_urls)
# Step 2: Get the most important websites from the final list
most_important_websites = get_top_relevant_websites(website_urls=online_urls)
print("--- Most Important Websites ---")
prettyprinter.cpprint(most_important_websites)
Result (Main real state agencies in Uruguay):
[
'https://www.infocasas.com.uy',
'https://www.casasweb.com.uy',
'https://www.mercadolibre.com.uy/inmuebles',
'https://www.uruguayinmobiliarias.com'
]
License
This project is distributed under the MIT License. See LICENSE for more information.
All suggestions and improvements are welcome!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai_url_aggregator-0.2.tar.gz.
File metadata
- Download URL: ai_url_aggregator-0.2.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02364f49165a08cd26755724d7e60bf3f55a85b573aa846b115d9a7f324b371c
|
|
| MD5 |
e8bcd2a1b4e62fd76bcf6ef1fa9a513e
|
|
| BLAKE2b-256 |
114c58cd4678197164d4f4b0fdff32940008e502eff074a3bac37210de6b54e8
|
File details
Details for the file ai_url_aggregator-0.2-py3-none-any.whl.
File metadata
- Download URL: ai_url_aggregator-0.2-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbbdd2b57276ca44f531f98e6b89232fda508969d5bc3c43db0e12fd83fd0d61
|
|
| MD5 |
d80fe26278b58c3f3364ef8efb49f992
|
|
| BLAKE2b-256 |
4af1724b01fb2a0e86639a3f62ec0f445b1bccd5c4e537c6a41cf3666406382c
|