Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(root_url, out_file='sitemap.xml')

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on
  • Lists
  • SQLite database
  • Redis
  • Write api for extending by user backends

changelog

v. 0.9.2

  • todo queue and done list backends
  • created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
  • tests for sqlite_todo backend

v. 0.9.1

  • extended readme
  • docstrings and code commentaries

v. 0.9.0

  • since this version package supports only python version >=3.7
  • all functions recreated but api saved. If You use this package, then just update it, install requirements and run process
  • all requests works asynchronously

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for sitemap-generator, version 0.9.2
Filename, size File type Python version Upload date Hashes
Filename, size sitemap_generator-0.9.2-py3-none-any.whl (14.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size sitemap-generator-0.9.2.tar.gz (8.8 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page