Skip to main content

Simple sitemap builder

Project description

A simple sitemap builder

The sitemap builder traverses links from a website and constrains itself to the given domain name. The final result will be a simple sitemap deduced from the links visited. The crawler will accept & process only URLs with http or https schemes.

Installation and usage

To run the following command to install the tool:

pip install -U sitemapbuilder

To run the sitemap builder:

sitemapbuilder -u 'https://monzo.com' -o test_monzo.dot

Some websites have strong protection and the tool will not work for them:

sitemapbuilder -u 'https://bloomberg.com' -o test_bloomberg.dot

Highlights

  1. Generate Graphviz .dot file showing directed links between pages. One can generate PNG/PDF and other image/document formats.

  2. Have configurable decay (maximum depth) to avoid abuse.

  3. Visit web link within the same hostname by default.

  4. Use 5 threads by default and times out after 10 seconds.

  5. Timeout after 5 seconds when fetching a URL.

  6. Handle timeout exceptions when querying a website.

  7. Send a HTTP HEAD request and verify that Content-Type is text/html and charset is either UTF-8 or US-ASCII.

  8. Have a map of visited URLs to avoid revisiting them.

  9. Follow HTTP redirects.

Upcoming features

  • Configure the number of threads and timeout via cmd args.

  • Allow web links from all subdomains.

  • Allow web links from a list of domains.

  • Allow web links matching a pattern.

  • Add an option for hierarchical sitemap instead of directed graph.

  • Use PriorityQueue instead of Queue to process links with higher decay first.

  • Fine-graned info, warn and error logging.

  • Pass seed links from a file.

  • Save to and resume from a DB/persistent data source.

  • Faster concurrency and better performance with asyncio.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemapbuilder-0.0.7.tar.gz (6.1 kB view details)

Uploaded Source

File details

Details for the file sitemapbuilder-0.0.7.tar.gz.

File metadata

  • Download URL: sitemapbuilder-0.0.7.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for sitemapbuilder-0.0.7.tar.gz
Algorithm Hash digest
SHA256 e749f0336d4707ce2007d14caf07f90efdebc4ffefadbfddc593898acfd085c0
MD5 be7e603d126eb38fac3557f840094e0a
BLAKE2b-256 799c5218276f3476c6d9d77f3737930015435b3b9d2464bfe00f99500436a477

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page