Skip to main content

Safarnama is a versatile web crawler that explores websites, cleans HTML, and uses a language model to generate summaries and tags—writing its own digital story. Use it via CLI or in Python projects.

Project description

Safarnama

Safarnama is named after the famous travel book Safarnama by Nasir Khusraw, a renowned Persian traveler, philosopher, and writer. Inspired by his journeys, Safarnama traverses the web, cleans up pages, and gathers useful information.

In addition to standard web crawling and content processing, Safarnama now includes a search ability. You can perform searches (e.g., "open source tech books"), retrieve links from search results via Se​arxNG instances, and process those links with the crawler to, for example, download PDFs.

Features

  • Web Crawling: Begins at a base URL and explores linked pages up to a specified depth.
  • Content Processing: Cleans HTML by removing scripts, styles, comments, and extraneous elements.
  • LLM Integration: Summarizes page content and extracts key tags via a Language Model endpoint.
  • Search Integration: Uses Se​arxNG to search for queries, fetch links, and feed them into the crawler.
  • Data Storage: Persists URL data, summaries, and tags in a SQLite database.
  • Sitemap Generation: Optionally generates an XML sitemap of crawled URLs.

Installation

Install Safarnama via pip:

pip install safarnama

Usage (Command-Line)

Initialize the configuration file (interactive or quiet mode):

safarnama init

Start the crawler:

safarnama start

Test the LLM endpoint:

safarnama test_llm

New: Perform a search and process the results:

safarnama search "open source tech books"

This command uses Se​arxNG instances to search for the query, extracts links from the search results, and then feeds those links to the crawler for further processing (e.g., downloading PDFs).

Sample Configuration File (config.yaml)

base_url: "https://www.techbend.io"
max_depth: 2
delay: 1
db_path: "techbend.db"
verbose: true
save: true
log_file: "techbend.log"
generate_sitemap: true
binary_extensions:
  - ".pdf"
  - ".zip"
  - ".exe"
  - ".tar"
  - ".tar.gz"
  - ".tgz"
  - ".rar"
  - ".iso"
  - ".bin"
  - ".7z"
  - ".dmg"
  - ".tar.xz"
  - ".pkg"
  - ".bz2"
accepted_content_types:
  - "text/html"
  - "application/xhtml+xml"
  - "text/plain"
  - "text/xml"
  - "application/xml"
  - "application/json"
llm:
  endpoint: "http://localhost:1234/v1/chat/completions"
  model: "jinaai.readerlm-v2@q4_k_m"
  max_tokens: 16529
  temperature: 0.7
  llm_prompt_template: "Please summarize the following webpage content and extract a list of relevant tags. Return your answer as JSON with keys 'summary' and 'tags'."
  system_prompt: "You are a helpful assistant that summarizes webpages and extracts tags."

Note: The LLM API key is managed via a separate .env file and is not stored in this configuration file.

Programmatic Usage

Safarnama can be used both as a CLI tool and as a library in your Python projects. For example:

from safarnama.config import load_config
from safarnama.db import DBHandler
from safarnama.searcher import SearxNGSearcher
from safarnama.crawler import SiteCrawler

# Load configuration from the YAML file
config = load_config("config.yaml")

# Create a DBHandler instance using the connection string from the config
db = DBHandler(config.get("connection_string", "sqlite:///python.db"))

# Create a SearxNGSearcher instance with a few retries
searcher = SearxNGSearcher(db, retries=2)

# Perform a search query (e.g., "open source tech books")
result = searcher.search("open source tech books")

if result:
    instance_used, data = result
    # Assume the search results contain a 'results' field with dictionaries that include a 'url'
    links = [item["url"] for item in data.get("results", []) if "url" in item]
    print(f"Found {len(links)} links.")

    # Create a crawler instance and add the search result links for processing
    crawler = SiteCrawler(config)
    for link in links:
        crawler.add_url(link, 0)

    # Start crawling and retrieve visited URLs
    visited_urls = crawler.crawl()
    print(f"Crawled {len(visited_urls)} URLs from search results.")

    # Optionally, generate an XML sitemap of the crawled URLs
    sitemap_tree = crawler.generate_sitemap(visited_urls)
    sitemap_tree.write("sitemap.xml", encoding="utf-8", xml_declaration=True)

    crawler.close()
else:
    print("No healthy instance available to perform the search.")

searcher.close()
db.close()

Contributing

Contributions are welcome! Please:

  1. Fork the repository.
  2. Create a new branch for your feature or bugfix.
  3. Write your code and tests.
  4. Run tests to ensure everything works.
  5. Submit a pull request with a clear description.

License

This project is licensed under the MIT License. See the LICENSE file for details.

✨ Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safarnama-0.2.2.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safarnama-0.2.2-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file safarnama-0.2.2.tar.gz.

File metadata

  • Download URL: safarnama-0.2.2.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c72e8c0e6ac7840b7ea858606f32d05339351fee7ba0db530a3e42cfe1fcdc89
MD5 f04be022768f5f3a71c5840428768024
BLAKE2b-256 bd139a9d34e4679b20d9b2acb91e965b4f60a681947b3c979f8ea05c4cb44e57

See more details on using hashes here.

File details

Details for the file safarnama-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: safarnama-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a8f5223c18357635e71d2232f1974fefe68d00599ed01521d2b1b70d593d9a07
MD5 5555390176e7679f58d5b6e02f672fc0
BLAKE2b-256 a2d7f77b9901f60fefb8b249b594387caf6236bfeb8bf9927ab40e0b83190da3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page