Skip to main content

Enter something here

Project description

SimpleCrawler

  • This web crawler can be used to crawl a website from the command line or code

Install

OR

  • git clone https://github.com/jackwardell/SimpleCrawler.git
  • cd SimpleCrawler
  • python3 -m venv venv
  • source venv/bin/activate
  • pip install --upgrade pip
  • pip install -r requirements.txt
  • pip install -e .
  • pytest
  • crawl https://www.example.com

Rules:

This crawler will:

  • Only crawl text/html mime-types
  • Only crawl pages that return 200 OK HTTP statuses
  • Look at /robots.txt and obey by default (but can be overridden)
  • Add User-Agent, default value = PyWebCrawler (but can be changed)
  • Ignore ?query=strings and #fragments by default (but can be changed)
  • Get links from ONLY href value in click here tags

Todo:

Use

  • just type crawl <url> into your command line e.g. crawl https://www.google.com
$ crawl --help
Usage: crawl [OPTIONS] URL

Options:
  -u, --user-agent TEXT
  -w, --max-workers INTEGER
  -t, --timeout INTEGER
  -h, --check-head
  -d, --disobey-robots
  -wq, --with-query
  -wf, --with-fragment
  --debug / --no-debug
  --help                     Show this message and exit.

OR from code

from simple_crawler import Crawler

crawler = Crawler()
found_links = crawler.crawl('https://www.example.com/')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleCrawler-1.0.1.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SimpleCrawler-1.0.1-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file SimpleCrawler-1.0.1.tar.gz.

File metadata

  • Download URL: SimpleCrawler-1.0.1.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.9.1

File hashes

Hashes for SimpleCrawler-1.0.1.tar.gz
Algorithm Hash digest
SHA256 482097a8f8bad29720aeef01958a1d03a5ba61599f66f27307705d9d060bd6c2
MD5 89d4f6e347f7b7931f35fb23f0a81334
BLAKE2b-256 a1b545885a387c35d841514fdd5e6f56bcaa60da8ba317adbeaf3f2450656e18

See more details on using hashes here.

File details

Details for the file SimpleCrawler-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: SimpleCrawler-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.1 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.9.1

File hashes

Hashes for SimpleCrawler-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6a666ee8dfd7bb7ebb6a08f83f5f78be2c3ff6d04557e5f7cf1ba9ce56b7bd0
MD5 4f6f418630652ee169f6bcd51d033c47
BLAKE2b-256 7685db01dbf753d8e59e742625591556426a4818ef2168345213ab4d6dfaeb49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page