Skip to main content

A simple and efficient web crawler in Python.

Project description

Tiny Web Crawler

CI Coverage badge Stable Version License: MIT Download Stats Discord

A simple and efficient web crawler for Python.

Features

  • Crawl web pages and extract links starting from a root URL recursively
  • Concurrent workers and custom delay
  • Handle relative and absolute URLs
  • Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings

settings = SpiderSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Spider(settings)
spider.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = SpiderSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Spider(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

  • If you are a first time contributor you can pick a good-first-issue and get started.
  • Please feel free to ask questions.
  • Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
  • We are working on doing our first major release. Please check this issue and see if anything interests you.

Dev setup

  • Install poetry in your system pipx install poetry
  • Clone the repo you forked
  • Create a venv or use poetry shell
  • Run poetry install --with dev
  • pre-commit install (see)
  • pre-commit install --hook-type pre-push

Before raising a PR. Please make sure you have these checks covered

  • An issue exists or is created which address the PR
  • Tests are written for the changes
  • All lint/test passes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiny_web_crawler-0.5.0.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

tiny_web_crawler-0.5.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file tiny_web_crawler-0.5.0.tar.gz.

File metadata

  • Download URL: tiny_web_crawler-0.5.0.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for tiny_web_crawler-0.5.0.tar.gz
Algorithm Hash digest
SHA256 9c75bc6ecfe8e81480647811ed615bec3de381ef01f93023ca8bc6b770de4820
MD5 7d87706c5713473e6228fb2c550e7e47
BLAKE2b-256 a3367cc4cf7870c4f8315168180871239da91fc88cf4c84f9d349f37a3031066

See more details on using hashes here.

File details

Details for the file tiny_web_crawler-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tiny_web_crawler-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fd176aa2d8595ca1a7e2e94770cf92de18a5a8983db7eecb629c9dd34b45f84
MD5 345171dd443af121c4ad13575486aa81
BLAKE2b-256 8d335f82de03623a7e97117e3667ea455702aa9a3e86a23be214384b429d30d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page