Skip to main content

Simple Python3 module to crawl a website and extract URLs

Project description

Pypi Build Status codecov MIT licensed

Simple Python module to crawl a website and extract URLs.

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url http://www.gab.lc --depth 3

# Add --verbose for verbose mode

->

* Found 4 internal URLs
  http://www.gab.lc
  http://www.gab.lc/articles
  http://www.gab.lc/contact
  http://www.gab.lc/about

* Found 8 external URLs
  https://gpgtools.org/
  http://en.wikipedia.org/wiki/GNU_General_Public_License
  http://en.wikipedia.org/wiki/Pretty_Good_Privacy
  http://en.wikipedia.org/wiki/GNU_Privacy_Guard
  https://www.gpgtools.org
  https://www.google.com/#hl=en&q=install+gpg+windows
  http://www.gnupg.org/gph/en/manual/x135.html
  http://keys.gnupg.net

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.github.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.3.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

sitecrawl-1.0.3-py2.py3-none-any.whl (5.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sitecrawl-1.0.3.tar.gz.

File metadata

  • Download URL: sitecrawl-1.0.3.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.3.tar.gz
Algorithm Hash digest
SHA256 36ccda1b717f4cbc21884e5cb82642fce6d634124cffd029602dd5a6f8917748
MD5 486a6b89ec2fe4d17db0d3447986cfad
BLAKE2b-256 37a7ac21c1bfc052a5e4e1fb8545e6f7eb33ffb4a59132e640dee613f700e947

See more details on using hashes here.

File details

Details for the file sitecrawl-1.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: sitecrawl-1.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9f0972d1ad325f485ff0699515cdf9068fa8b436d4e7741d474ab345f14a4ee1
MD5 cb5c4ecd67b78d8b04075bf3af38824c
BLAKE2b-256 9aba60c57a86e6fe13384f8cc93c85dbf8c4dad3eee8e2a0ca3a4babd1528262

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page