Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(root_url, out_file='sitemap.xml')

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on
  • Lists
  • SQLite database
  • Redis
  • Write api for extending by user backends

changelog

v. 0.9.2

  • todo queue and done list backends
  • created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)
  • tests for sqlite_todo backend

v. 0.9.1

  • extended readme
  • docstrings and code commentaries

v. 0.9.0

  • since this version package supports only python version >=3.7
  • all functions recreated but api saved. If You use this package, then just update it, install requirements and run process
  • all requests works asynchronously

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for sitemap-generator, version 0.9.4
Filename, size File type Python version Upload date Hashes
Filename, size sitemap_generator-0.9.4-py3-none-any.whl (14.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size sitemap-generator-0.9.4.tar.gz (8.8 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page