Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(root_url, out_file='sitemap.xml')

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

v. 0.9.2

  • todo queue and done list backends

  • created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)

  • tests for sqlite_todo backend

v. 0.9.1

  • extended readme

  • docstrings and code commentaries

v. 0.9.0

  • since this version package supports only python version >=3.7

  • all functions recreated but api saved. If You use this package, then just update it, install requirements and run process

  • all requests works asynchronously

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.4.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.4-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.4.tar.gz.

File metadata

  • Download URL: sitemap-generator-0.9.4.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for sitemap-generator-0.9.4.tar.gz
Algorithm Hash digest
SHA256 ba62ffa947fb1c02bcc3c925f69a921ce96b7b2d88398e808e3ccd9b96b48bcd
MD5 c0e39e0237f3335d40b2e9b1e95bdd15
BLAKE2b-256 377b533126406a2d16b0ce4fabfae92d5fa45574698b278bd2c37b4735940aa0

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.4-py3-none-any.whl.

File metadata

  • Download URL: sitemap_generator-0.9.4-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for sitemap_generator-0.9.4-py3-none-any.whl
Algorithm Hash digest
SHA256 66f08d764589747c512a8fb6cc8eaf8b5a7cc36cc9fc5e175914aeb9f11b9511
MD5 5d9d7d8969916771ebe9cabcd3eeece7
BLAKE2b-256 4746a73f26e0a7175c421505f52893ededadab0ceb2db822e14bdda3cc5caebc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page