Skip to main content

Simple Python3 module to crawl a website and extract URLs

Project description

Pypi Build Status codecov MIT licensed

Simple Python module to crawl a website and extract URLs.

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url http://www.gab.lc --depth 3

# Add --verbose for verbose mode

->

* Found 4 internal URLs
  http://www.gab.lc
  http://www.gab.lc/articles
  http://www.gab.lc/contact
  http://www.gab.lc/about

* Found 8 external URLs
  https://gpgtools.org/
  http://en.wikipedia.org/wiki/GNU_General_Public_License
  http://en.wikipedia.org/wiki/Pretty_Good_Privacy
  http://en.wikipedia.org/wiki/GNU_Privacy_Guard
  https://www.gpgtools.org
  https://www.google.com/#hl=en&q=install+gpg+windows
  http://www.gnupg.org/gph/en/manual/x135.html
  http://keys.gnupg.net

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.github.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.2.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

sitecrawl-1.0.2-py2.py3-none-any.whl (5.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sitecrawl-1.0.2.tar.gz.

File metadata

  • Download URL: sitecrawl-1.0.2.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.2.tar.gz
Algorithm Hash digest
SHA256 65b75e9304fab6930ca0e0b3221d28c3351c08f00cefa7d99311fa56f3ebd5db
MD5 a8baea517ba4b0ce873231aa115d2403
BLAKE2b-256 5c827ee5e2b70cd5c044f5ecafbb21d8209155f910a39c2c08052e22cd9ead7a

See more details on using hashes here.

File details

Details for the file sitecrawl-1.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: sitecrawl-1.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ad6a7ba46da29d73c21341529a1694d6f1c6c169c215e76d8d72caf0a0ee42f5
MD5 85ef2fe8094e562085a15edce9c24c9f
BLAKE2b-256 c21db5679d4834db473e60c46cf7ad85c2be369b8845f9581bdb4326a016cdf5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page