Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler
from pysitemap.parsers.lxml_parser import Parser

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(
        root_url, out_file='debug/sitemap.xml', exclude_urls=[".pdf", ".jpg", ".zip"],
        http_request_options={"ssl": False}, parser=Parser
    )

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.10.win-amd64.zip (28.3 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.10-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.10.win-amd64.zip.

File metadata

File hashes

Hashes for sitemap-generator-0.9.10.win-amd64.zip
Algorithm Hash digest
SHA256 6495c6f5cd0ebc556ed6e770452f91399270b83d89589a8421e4edc8a23ff246
MD5 aef6e76a9be3c1dea10a9590ef4047c3
BLAKE2b-256 9fe48efaf653d767f505cda78b95463e9defcfa9564d06242fc3ae38d9993f8a

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.10-py3-none-any.whl.

File metadata

File hashes

Hashes for sitemap_generator-0.9.10-py3-none-any.whl
Algorithm Hash digest
SHA256 c3c0de27503018cd6476bca4bb55318ce6a207cae1b654f8df6fb674470ae5fd
MD5 19ad0b4d6fa593be21f9c79d0090b78f
BLAKE2b-256 3e3405a2e748b6cca4a9c821628f9e0886f73d097b3570b8064d8c2ed89f24cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page