Safarnama is a versatile web crawler that explores websites, cleans HTML, and uses a language model to generate summaries and tags—writing its own digital story. Use it via CLI or in Python projects.
Project description
Safarnama
Safarnama is named after the famous travel book Safarnama by Nasir Khusraw, a renowned Persian traveler, philosopher, and writer. Inspired by his journeys, Safarnama traverses the web, cleans up pages, and gathers useful information.
In addition to standard web crawling and content processing, Safarnama now includes a search ability. You can perform searches (e.g., "open source tech books"), retrieve links from search results via SearxNG instances, and process those links with the crawler to, for example, download PDFs.
Features
- Web Crawling: Begins at a base URL and explores linked pages up to a specified depth.
- Content Processing: Cleans HTML by removing scripts, styles, comments, and extraneous elements.
- LLM Integration: Summarizes page content and extracts key tags via a Language Model endpoint.
- Search Integration: Uses SearxNG to search for queries, fetch links, and feed them into the crawler.
- Data Storage: Persists URL data, summaries, and tags in a SQLite database.
- Sitemap Generation: Optionally generates an XML sitemap of crawled URLs.
Installation
Install Safarnama via pip:
pip install safarnama
Usage (Command-Line)
Initialize the configuration file (interactive or quiet mode):
safarnama init
Start the crawler:
safarnama start
Test the LLM endpoint:
safarnama test_llm
New: Perform a search and process the results:
safarnama search "open source tech books"
This command uses SearxNG instances to search for the query, extracts links from the search results, and then feeds those links to the crawler for further processing (e.g., downloading PDFs).
Sample Configuration File (config.yaml)
base_url: "https://www.techbend.io"
max_depth: 2
delay: 1
db_path: "techbend.db"
verbose: true
save: true
log_file: "techbend.log"
generate_sitemap: true
binary_extensions:
- ".pdf"
- ".zip"
- ".exe"
- ".tar"
- ".tar.gz"
- ".tgz"
- ".rar"
- ".iso"
- ".bin"
- ".7z"
- ".dmg"
- ".tar.xz"
- ".pkg"
- ".bz2"
accepted_content_types:
- "text/html"
- "application/xhtml+xml"
- "text/plain"
- "text/xml"
- "application/xml"
- "application/json"
llm:
endpoint: "http://localhost:1234/v1/chat/completions"
model: "jinaai.readerlm-v2@q4_k_m"
max_tokens: 16529
temperature: 0.7
llm_prompt_template: "Please summarize the following webpage content and extract a list of relevant tags. Return your answer as JSON with keys 'summary' and 'tags'."
system_prompt: "You are a helpful assistant that summarizes webpages and extracts tags."
Note: The LLM API key is managed via a separate .env file and is not stored in this configuration file.
Programmatic Usage
Safarnama can be used both as a CLI tool and as a library in your Python projects. For example:
from safarnama.config import load_config
from safarnama.db import DBHandler
from safarnama.searcher import SearxNGSearcher
from safarnama.crawler import SiteCrawler
# Load configuration from the YAML file
config = load_config("config.yaml")
# Create a DBHandler instance using the connection string from the config
db = DBHandler(config.get("connection_string", "sqlite:///python.db"))
# Create a SearxNGSearcher instance with a few retries
searcher = SearxNGSearcher(db, retries=2)
# Perform a search query (e.g., "open source tech books")
result = searcher.search("open source tech books")
if result:
instance_used, data = result
# Assume the search results contain a 'results' field with dictionaries that include a 'url'
links = [item["url"] for item in data.get("results", []) if "url" in item]
print(f"Found {len(links)} links.")
# Create a crawler instance and add the search result links for processing
crawler = SiteCrawler(config)
for link in links:
crawler.add_url(link, 0)
# Start crawling and retrieve visited URLs
visited_urls = crawler.crawl()
print(f"Crawled {len(visited_urls)} URLs from search results.")
# Optionally, generate an XML sitemap of the crawled URLs
sitemap_tree = crawler.generate_sitemap(visited_urls)
sitemap_tree.write("sitemap.xml", encoding="utf-8", xml_declaration=True)
crawler.close()
else:
print("No healthy instance available to perform the search.")
searcher.close()
db.close()
Contributing
Contributions are welcome! Please:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Write your code and tests.
- Run tests to ensure everything works.
- Submit a pull request with a clear description.
License
This project is licensed under the MIT License. See the LICENSE file for details.
✨ Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safarnama-0.2.1.tar.gz.
File metadata
- Download URL: safarnama-0.2.1.tar.gz
- Upload date:
- Size: 34.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bada58d2fe685e0b9be8b119f2524e9a2721737857ebb3ca006c8a7c5ae40aef
|
|
| MD5 |
cb9f9d05b16e6dc6e8e8e1995b3faa4e
|
|
| BLAKE2b-256 |
39f31edd2812661442e1fd4f36b2cc8c0265a358c5e7eb47fae1056e789e5f7f
|
File details
Details for the file safarnama-0.2.1-py3-none-any.whl.
File metadata
- Download URL: safarnama-0.2.1-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17545583be55144f13a6e6ede56effdf057beed9f7a5092ac15ec5fc81e641ab
|
|
| MD5 |
8f1b8d585bff73a76988ceba82752a0e
|
|
| BLAKE2b-256 |
b68217ab36f4e23b82652e68f0078624518f1b43bc7e6b6b344bf082f3fc8641
|