Safarnama is a versatile web crawler that explores websites, cleans HTML, and uses a language model to generate summaries and tags—writing its own digital story. Use it via CLI or in Python projects.

Project description

Safarnama

Safarnama is named after the famous travel book Safarnama by Nasir Khusraw, a renowned Persian traveler, philosopher, and writer. Inspired by his journeys, Safarnama traverses the web, cleans up pages, and gathers useful information.

In addition to standard web crawling and content processing, Safarnama now includes a search ability. You can perform searches (e.g., "open source tech books"), retrieve links from search results via SearxNG instances, and process those links with the crawler to, for example, download PDFs.

Features

Web Crawling: Begins at a base URL and explores linked pages up to a specified depth.
Content Processing: Cleans HTML by removing scripts, styles, comments, and extraneous elements.
LLM Integration: Summarizes page content and extracts key tags via a Language Model endpoint.
Search Integration: Uses SearxNG to search for queries, fetch links, and feed them into the crawler.
Data Storage: Persists URL data, summaries, and tags in a SQLite database.
Sitemap Generation: Optionally generates an XML sitemap of crawled URLs.

Installation

Install Safarnama via pip:

pip install safarnama

Usage (Command-Line)

Initialize the configuration file (interactive or quiet mode):

safarnama init

Start the crawler:

safarnama start

Test the LLM endpoint:

safarnama test_llm

New: Perform a search and process the results:

safarnama search "open source tech books"

This command uses SearxNG instances to search for the query, extracts links from the search results, and then feeds those links to the crawler for further processing (e.g., downloading PDFs).

Sample Configuration File (`config.yaml`)

base_url: "https://www.techbend.io"
max_depth: 2
delay: 1
db_path: "techbend.db"
verbose: true
save: true
log_file: "techbend.log"
generate_sitemap: true
binary_extensions:
  - ".pdf"
  - ".zip"
  - ".exe"
  - ".tar"
  - ".tar.gz"
  - ".tgz"
  - ".rar"
  - ".iso"
  - ".bin"
  - ".7z"
  - ".dmg"
  - ".tar.xz"
  - ".pkg"
  - ".bz2"
accepted_content_types:
  - "text/html"
  - "application/xhtml+xml"
  - "text/plain"
  - "text/xml"
  - "application/xml"
  - "application/json"
llm:
  endpoint: "http://localhost:1234/v1/chat/completions"
  model: "jinaai.readerlm-v2@q4_k_m"
  max_tokens: 16529
  temperature: 0.7
  llm_prompt_template: "Please summarize the following webpage content and extract a list of relevant tags. Return your answer as JSON with keys 'summary' and 'tags'."
  system_prompt: "You are a helpful assistant that summarizes webpages and extracts tags."

Note: The LLM API key is managed via a separate .env file and is not stored in this configuration file.

Programmatic Usage

Safarnama can be used both as a CLI tool and as a library in your Python projects. For example:

from safarnama.config import load_config
from safarnama.db import DBHandler
from safarnama.searcher import SearxNGSearcher
from safarnama.crawler import SiteCrawler

# Load configuration from the YAML file
config = load_config("config.yaml")

# Create a DBHandler instance using the connection string from the config
db = DBHandler(config.get("connection_string", "sqlite:///python.db"))

# Create a SearxNGSearcher instance with a few retries
searcher = SearxNGSearcher(db, retries=2)

# Perform a search query (e.g., "open source tech books")
result = searcher.search("open source tech books")

if result:
    instance_used, data = result
    # Assume the search results contain a 'results' field with dictionaries that include a 'url'
    links = [item["url"] for item in data.get("results", []) if "url" in item]
    print(f"Found {len(links)} links.")

    # Create a crawler instance and add the search result links for processing
    crawler = SiteCrawler(config)
    for link in links:
        crawler.add_url(link, 0)

    # Start crawling and retrieve visited URLs
    visited_urls = crawler.crawl()
    print(f"Crawled {len(visited_urls)} URLs from search results.")

    # Optionally, generate an XML sitemap of the crawled URLs
    sitemap_tree = crawler.generate_sitemap(visited_urls)
    sitemap_tree.write("sitemap.xml", encoding="utf-8", xml_declaration=True)

    crawler.close()
else:
    print("No healthy instance available to perform the search.")

searcher.close()
db.close()

Contributing

Contributions are welcome! Please:

Fork the repository.
Create a new branch for your feature or bugfix.
Write your code and tests.
Run tests to ensure everything works.
Submit a pull request with a clear description.

License

This project is licensed under the MIT License. See the LICENSE file for details.

✨ Contributors

Project details

Release history Release notifications | RSS feed

This version

0.2.2

Mar 4, 2025

0.2.1

Mar 4, 2025

0.2.0

Mar 4, 2025

0.1.4

Feb 26, 2025

0.1.3

Feb 26, 2025

0.1.2

Feb 26, 2025

0.1.1 yanked

Feb 26, 2025

Reason this release was yanked:

fixing commandline

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safarnama-0.2.2.tar.gz (34.0 kB view details)

Uploaded Mar 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

safarnama-0.2.2-py3-none-any.whl (16.2 kB view details)

Uploaded Mar 4, 2025 Python 3

File details

Details for the file safarnama-0.2.2.tar.gz.

File metadata

Download URL: safarnama-0.2.2.tar.gz
Upload date: Mar 4, 2025
Size: 34.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`c72e8c0e6ac7840b7ea858606f32d05339351fee7ba0db530a3e42cfe1fcdc89`
MD5	`f04be022768f5f3a71c5840428768024`
BLAKE2b-256	`bd139a9d34e4679b20d9b2acb91e965b4f60a681947b3c979f8ea05c4cb44e57`

See more details on using hashes here.

File details

Details for the file safarnama-0.2.2-py3-none-any.whl.

File metadata

Download URL: safarnama-0.2.2-py3-none-any.whl
Upload date: Mar 4, 2025
Size: 16.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.3

File hashes

Hashes for safarnama-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a8f5223c18357635e71d2232f1974fefe68d00599ed01521d2b1b70d593d9a07`
MD5	`5555390176e7679f58d5b6e02f672fc0`
BLAKE2b-256	`a2d7f77b9901f60fefb8b249b594387caf6236bfeb8bf9927ab40e0b83190da3`

See more details on using hashes here.

Safarnama 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Safarnama

Features

Installation

Usage (Command-Line)

Sample Configuration File (`config.yaml`)

Programmatic Usage

Contributing

License

✨ Contributors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Safarnama 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Safarnama

Features

Installation

Usage (Command-Line)

Sample Configuration File (config.yaml)

Programmatic Usage

Contributing

License

✨ Contributors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Sample Configuration File (`config.yaml`)