Skip to main content

web crawler and sitemap generator.

Project description

Sitemap generator

installing

pip install sitemap-generator

requirements

asyncio
aiofile
aiohttp

example

import sys
import logging
from pysitemap import crawler

if __name__ == '__main__':
    if '--iocp' in sys.argv:
        from asyncio import events, windows_events
        sys.argv.remove('--iocp')
        logging.info('using iocp')
        el = windows_events.ProactorEventLoop()
        events.set_event_loop(el)

    # root_url = sys.argv[1]
    root_url = 'https://www.haikson.com'
    crawler(root_url, out_file='sitemap.xml', exclude_urls=[".pdf", ".jpg", ".zip"])

TODO

  • big sites with count of pages more then 100K will use more then 100MB memory. Move queue and done lists into database. Write Queue and Done backend classes based on

  • Lists

  • SQLite database

  • Redis

  • Write api for extending by user backends

changelog

v. 0.9.8

  • new exlude_urls parameter for pysitemap.crowler

  • Crawler. exclude_urls parameter.

    System checks for current url not contains each substring from exclude_urls. Default value is empty list

  • Crawler. set_exclude_url method.

v. 0.9.2

  • todo queue and done list backends

  • created very slowest sqlite backend for todo queue and done lists (1000 url writing for 3 minutes)

  • tests for sqlite_todo backend

v. 0.9.1

  • extended readme

  • docstrings and code commentaries

v. 0.9.0

  • since this version package supports only python version >=3.7

  • all functions recreated but api saved. If You use this package, then just update it, install requirements and run process

  • all requests works asynchronously

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemap-generator-0.9.8.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

sitemap_generator-0.9.8-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file sitemap-generator-0.9.8.tar.gz.

File metadata

  • Download URL: sitemap-generator-0.9.8.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6

File hashes

Hashes for sitemap-generator-0.9.8.tar.gz
Algorithm Hash digest
SHA256 8ec485ce81897bd85891ee275e1aff4a3fcbb8fed1b5caada7d0c40c4b46ff7c
MD5 31ca28b04dd836708e2aee867e5d82b7
BLAKE2b-256 dfbb50c245083854eadba8ccff0c8d069c73700dcf0962a0643f745337598bed

See more details on using hashes here.

File details

Details for the file sitemap_generator-0.9.8-py3-none-any.whl.

File metadata

  • Download URL: sitemap_generator-0.9.8-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.6

File hashes

Hashes for sitemap_generator-0.9.8-py3-none-any.whl
Algorithm Hash digest
SHA256 0044bfdd0d1a23d7ae6fbadbee2029f4f9fac78883de444f615f22ebf5bc8910
MD5 edb9cbedc3c069cfa7af1d80d1f9e57c
BLAKE2b-256 722a98db9c50550c04bab00144b55d9b7faf9c8e5622575029362fe370032ed2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page