Skip to main content

Simple Python3 module to crawl a website and extract URLs

Project description

Pypi Build Status codecov MIT licensed

Simple Python module to crawl a website and extract URLs.

Installation

Using pip:

pip3 install sitecrawl

sitecrawl --help

Or build from sources:

# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl

# Installation
pip3 install .

Usage

CLI

sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose

->

* Found 4 internal URLs
  https://www.yahoo.com
  https://www.yahoo.com/entertainment
  https://www.yahoo.com/lifestyle
  https://www.yahoo.com/plus

* Found 5 external URLs
  https://mail.yahoo.com/
  https://news.yahoo.com/
  https://finance.yahoo.com/
  https://sports.yahoo.com/
  https://shopping.yahoo.com/

* Skipped 0 URLs

As a module

Basic example:

from sitecrawl import crawl

crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)

print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())

A more detailed example is available in example.py.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitecrawl-1.0.5.tar.gz (5.4 kB view hashes)

Uploaded Source

Built Distribution

sitecrawl-1.0.5-py2.py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page