sitecrawl

Simple Python3 module to crawl a website and extract URLs

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Simple Python module to crawl a website and extract URLs.

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url http://www.gab.lc --depth 3

# Add --verbose for verbose mode

* Found 4 internal URLs
  http://www.gab.lc
  http://www.gab.lc/articles
  http://www.gab.lc/contact
  http://www.gab.lc/about

* Found 8 external URLs
  https://gpgtools.org/
  http://en.wikipedia.org/wiki/GNU_General_Public_License
  http://en.wikipedia.org/wiki/Pretty_Good_Privacy
  http://en.wikipedia.org/wiki/GNU_Privacy_Guard
  https://www.gpgtools.org
  https://www.google.com/#hl=en&q=install+gpg+windows
  http://www.gnupg.org/gph/en/manual/x135.html
  http://keys.gnupg.net

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.github.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.5

Jan 28, 2022

1.0.4

Jan 25, 2022

1.0.3

Jan 25, 2022

1.0.2

Jan 25, 2022

This version

1.0.1

Jan 25, 2022

1.0

Jan 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.1.tar.gz (5.4 kB view hashes)

Uploaded Jan 25, 2022 Source

Built Distribution

sitecrawl-1.0.1-py2.py3-none-any.whl (5.7 kB view hashes)

Uploaded Jan 25, 2022 Python 2 Python 3

Hashes for sitecrawl-1.0.1.tar.gz

Hashes for sitecrawl-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`26b978b7e048eeec09b4e76e12e720480124f2ea4d2200b0e97cff175c79eb58`
MD5	`a59349aa7ab84054c30180d0c416b07c`
BLAKE2b-256	`014c02729c8ef1b1b4bc928be02c02ac1e78d385f515bb283b3cc1baf64543e6`

Hashes for sitecrawl-1.0.1-py2.py3-none-any.whl

Hashes for sitecrawl-1.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`b697d4c9431a93e97552e9fdae3d3408be1bbc1b5e31a90178983536531ca3d0`
MD5	`6d50139aeeceb0a66ad1d0a75ab7642b`
BLAKE2b-256	`c42deae5e7a15a41b62eecf7d23a1873ec53e123ebd96a358c5fc15c7bbe699c`