Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler
from pysitemap.parsers.lxml_parser import Parser

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(
        root_url, out_file='debug/sitemap.xml', exclude_urls=[".pdf", ".jpg", ".zip"],
        http_request_options={"ssl": False}, parser=Parser
    )

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.11.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.11-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.11.tar.gz.

File metadata

  • Download URL: sitemap-generator-0.9.11.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for sitemap-generator-0.9.11.tar.gz
Algorithm Hash digest
SHA256 d96a7abf69ad33661ae6c9813e047f3a5a30a2dc8125da15870bdec7585bf395
MD5 13f4556964b1608539bb61bb0ee3a8d9
BLAKE2b-256 14e7bb61a5a2fe3735a061d806c42fe50cd29b3668470b6201091244e58e6227

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.11-py3-none-any.whl.

File metadata

File hashes

Hashes for sitemap_generator-0.9.11-py3-none-any.whl
Algorithm Hash digest
SHA256 c767a602372b1a7011d0a15836c095ded3b71d84c945cebb985b3fe853859dd1
MD5 6131ce121cdf1d8f744e376383c7af9c
BLAKE2b-256 f6c85193c0233d2bcd7577b19fb19b453c583021c774fbe85f6d7050c8375037

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page