Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(root_url, out_file='sitemap.xml')

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

v. 0.9.2

  • todo queue and done list backends

  • created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)

  • tests for sqlite_todo backend

v. 0.9.1

  • extended readme

  • docstrings and code commentaries

v. 0.9.0

  • since this version package supports only python version >=3.7

  • all functions recreated but api saved. If You use this package, then just update it, install requirements and run process

  • all requests works asynchronously

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.2.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.2-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.2.tar.gz.

File metadata

  • Download URL: sitemap-generator-0.9.2.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for sitemap-generator-0.9.2.tar.gz
Algorithm Hash digest
SHA256 4fc6a2b68c17446eeee5df57e45b25767188700936ac197fd2443f8378337f8d
MD5 5680aea1affb134e5fac933f975de62f
BLAKE2b-256 110cb226006d0c8f79f743a36ef292b2feff71c952024c792d3e4219bcf4c4e3

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: sitemap_generator-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for sitemap_generator-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 961accfc74e63eabd1a10b6b0eba1059672d6f668a1c187057473dee167ab85c
MD5 eb0622422c5e17fa2cb3a24a4e3d35d1
BLAKE2b-256 e299cfad76e6ff820fe32625daa6174bbd28509a9da5866b1bfcdb56504c634a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page