Skip to main content

Simple Python3 module to crawl a website and extract URLs

Project description

Pypi Build Status codecov MIT licensed

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url http://www.gab.lc --depth 3

# Add --verbose for verbose mode

->

* Found 4 internal URLs
  http://www.gab.lc
  http://www.gab.lc/articles
  http://www.gab.lc/contact
  http://www.gab.lc/about

* Found 8 external URLs
  https://gpgtools.org/
  http://en.wikipedia.org/wiki/GNU_General_Public_License
  http://en.wikipedia.org/wiki/Pretty_Good_Privacy
  http://en.wikipedia.org/wiki/GNU_Privacy_Guard
  https://www.gpgtools.org
  https://www.google.com/#hl=en&q=install+gpg+windows
  http://www.gnupg.org/gph/en/manual/x135.html
  http://keys.gnupg.net

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.github.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

sitecrawl-1.0-py2.py3-none-any.whl (4.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sitecrawl-1.0.tar.gz.

File metadata

  • Download URL: sitecrawl-1.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.tar.gz
Algorithm Hash digest
SHA256 0c4f0c0264aef1ed8b7cfc5bee4a2e2971f24a3afc28a100d5249a2c1199e73f
MD5 e5f38ea08b87e8da91b0230a0b7451de
BLAKE2b-256 17f05cb2c4ef98577823ab405bad9aa30d8c44a826b3e0e01ab63a334f2828c3

See more details on using hashes here.

File details

Details for the file sitecrawl-1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: sitecrawl-1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 94862e8ed47c9f5ecf8d5e2987d5a85cce0611da2a0c39afc183c3cfb61782f6
MD5 67b125be0036b96b9303105022c98a54
BLAKE2b-256 c144c58a3fdf621f8de56aa2fa21f9ab16db8a928950b3611e2a525a35300bcd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page