Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler
from pysitemap.parsers.lxml_parser import Parser

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(
        root_url, out_file='debug/sitemap.xml', exclude_urls=[".pdf", ".jpg", ".zip"],
        http_request_options={"ssl": False}, parser=Parser
    )

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.13.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.13-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.13.tar.gz.

File metadata

  • Download URL: sitemap-generator-0.9.13.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for sitemap-generator-0.9.13.tar.gz
Algorithm Hash digest
SHA256 62ed54b45e7d3c3380a10bc877f3f213a4b13ac188d168da7d7aae10902c9327
MD5 c1e13d2fc27e433f344217e84de5f148
BLAKE2b-256 7cd9a67678449c608eba9ad2eebe9c4189d18cb529f1a3f889abc246b2666631

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.13-py3-none-any.whl.

File metadata

File hashes

Hashes for sitemap_generator-0.9.13-py3-none-any.whl
Algorithm Hash digest
SHA256 1eb690631895f5940269747f08e966c3fcc0efebd8a3d934ce2549aaceef0885
MD5 07cab60bcb0733c510eec8bf3c2282a9
BLAKE2b-256 4448855480617478c341732174421c69e9350fdec7efd7d6df8d203ee89fb14c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page