Skip to main content

Simple Python3 module to crawl a website and extract URLs

Project description

Pypi Build Status codecov MIT licensed

Simple Python module to crawl a website and extract URLs.

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose

->

* Found 4 internal URLs
  https://www.yahoo.com
  https://www.yahoo.com/entertainment
  https://www.yahoo.com/lifestyle
  https://www.yahoo.com/plus

* Found 5 external URLs
  https://mail.yahoo.com/
  https://news.yahoo.com/
  https://finance.yahoo.com/
  https://sports.yahoo.com/
  https://shopping.yahoo.com/

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.5.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

sitecrawl-1.0.5-py2.py3-none-any.whl (6.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file sitecrawl-1.0.5.tar.gz.

File metadata

  • Download URL: sitecrawl-1.0.5.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.5.tar.gz
Algorithm Hash digest
SHA256 203417c73038f3beb7ad185985ffe7fe0aef2c6d0b88a6bfbcab886e4350eb07
MD5 7b46609d564f7aafe48d19210b407364
BLAKE2b-256 99d6cf003181dc0a933c51e8274ca964adf7d6a63b2cbe1c2c42819e4aaf0d5e

See more details on using hashes here.

File details

Details for the file sitecrawl-1.0.5-py2.py3-none-any.whl.

File metadata

  • Download URL: sitecrawl-1.0.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3

File hashes

Hashes for sitecrawl-1.0.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 39d8e219c7e75252395ef91666626c8c4fb72657f632481af3f4e6db019f1e60
MD5 3afd6f6ebf96483946368b2bac7ca4f3
BLAKE2b-256 aaf556d45dffb05dd630a0dc062859088e99f0178e3c39bcd3332814d9ff6523

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page