Skip to main content

Safarnama is a versatile web crawler that explores websites, cleans HTML, and uses a language model to generate summaries and tags—writing its own digital story. Use it via CLI or in Python projects.

Project description

Safarnama

Safarnama is named after the famous travel book Safarnama by Nasir Khusraw, a renowned Persian traveler, philosopher, and writer. Inspired by his journeys, Safarnama traverses the web, cleans up pages, and gathers useful information.

In addition to standard web crawling and content processing, Safarnama now includes a search ability. You can perform searches (e.g., "open source tech books"), retrieve links from search results via Se​arxNG instances, and process those links with the crawler to, for example, download PDFs.

Features

  • Web Crawling: Begins at a base URL and explores linked pages up to a specified depth.
  • Content Processing: Cleans HTML by removing scripts, styles, comments, and extraneous elements.
  • LLM Integration: Summarizes page content and extracts key tags via a Language Model endpoint.
  • Search Integration: Uses Se​arxNG to search for queries, fetch links, and feed them into the crawler.
  • Data Storage: Persists URL data, summaries, and tags in a SQLite database.
  • Sitemap Generation: Optionally generates an XML sitemap of crawled URLs.

Installation

Install Safarnama via pip:

pip install safarnama

Usage (Command-Line)

Initialize the configuration file (interactive or quiet mode):

safarnama init

Start the crawler:

safarnama start

Test the LLM endpoint:

safarnama test_llm

New: Perform a search and process the results:

safarnama search "open source tech books"

This command uses Se​arxNG instances to search for the query, extracts links from the search results, and then feeds those links to the crawler for further processing (e.g., downloading PDFs).

Sample Configuration File (config.yaml)

base_url: "https://www.techbend.io"
max_depth: 2
delay: 1
db_path: "techbend.db"
verbose: true
save: true
log_file: "techbend.log"
generate_sitemap: true
binary_extensions:
  - ".pdf"
  - ".zip"
  - ".exe"
  - ".tar"
  - ".tar.gz"
  - ".tgz"
  - ".rar"
  - ".iso"
  - ".bin"
  - ".7z"
  - ".dmg"
  - ".tar.xz"
  - ".pkg"
  - ".bz2"
accepted_content_types:
  - "text/html"
  - "application/xhtml+xml"
  - "text/plain"
  - "text/xml"
  - "application/xml"
  - "application/json"
llm:
  endpoint: "http://localhost:1234/v1/chat/completions"
  model: "jinaai.readerlm-v2@q4_k_m"
  max_tokens: 16529
  temperature: 0.7
  llm_prompt_template: "Please summarize the following webpage content and extract a list of relevant tags. Return your answer as JSON with keys 'summary' and 'tags'."
  system_prompt: "You are a helpful assistant that summarizes webpages and extracts tags."

Note: The LLM API key is managed via a separate .env file and is not stored in this configuration file.

Programmatic Usage

Safarnama can be used both as a CLI tool and as a library in your Python projects. For example:

from safarnama.config import load_config
from safarnama.db import DBHandler
from safarnama.searcher import SearxNGSearcher
from safarnama.crawler import SiteCrawler

# Load configuration from the YAML file
config = load_config("config.yaml")

# Create a DBHandler instance using the connection string from the config
db = DBHandler(config.get("connection_string", "sqlite:///python.db"))

# Create a SearxNGSearcher instance with a few retries
searcher = SearxNGSearcher(db, retries=2)

# Perform a search query (e.g., "open source tech books")
result = searcher.search("open source tech books")

if result:
    instance_used, data = result
    # Assume the search results contain a 'results' field with dictionaries that include a 'url'
    links = [item["url"] for item in data.get("results", []) if "url" in item]
    print(f"Found {len(links)} links.")

    # Create a crawler instance and add the search result links for processing
    crawler = SiteCrawler(config)
    for link in links:
        crawler.add_url(link, 0)

    # Start crawling and retrieve visited URLs
    visited_urls = crawler.crawl()
    print(f"Crawled {len(visited_urls)} URLs from search results.")

    # Optionally, generate an XML sitemap of the crawled URLs
    sitemap_tree = crawler.generate_sitemap(visited_urls)
    sitemap_tree.write("sitemap.xml", encoding="utf-8", xml_declaration=True)

    crawler.close()
else:
    print("No healthy instance available to perform the search.")

searcher.close()
db.close()

Contributing

Contributions are welcome! Please:

  1. Fork the repository.
  2. Create a new branch for your feature or bugfix.
  3. Write your code and tests.
  4. Run tests to ensure everything works.
  5. Submit a pull request with a clear description.

License

This project is licensed under the MIT License. See the LICENSE file for details.

✨ Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safarnama-0.2.1.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safarnama-0.2.1-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file safarnama-0.2.1.tar.gz.

File metadata

  • Download URL: safarnama-0.2.1.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.1.tar.gz
Algorithm Hash digest
SHA256 bada58d2fe685e0b9be8b119f2524e9a2721737857ebb3ca006c8a7c5ae40aef
MD5 cb9f9d05b16e6dc6e8e8e1995b3faa4e
BLAKE2b-256 39f31edd2812661442e1fd4f36b2cc8c0265a358c5e7eb47fae1056e789e5f7f

See more details on using hashes here.

File details

Details for the file safarnama-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: safarnama-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 17545583be55144f13a6e6ede56effdf057beed9f7a5092ac15ec5fc81e641ab
MD5 8f1b8d585bff73a76988ceba82752a0e
BLAKE2b-256 b68217ab36f4e23b82652e68f0078624518f1b43bc7e6b6b344bf082f3fc8641

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page