A simple and efficient web crawler in Python.

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

Tiny Web Crawler

A simple and efficient web crawler for Python.

Features

Crawl web pages and extract links starting from a root URL recursively
Concurrent workers and custom delay
Handle relative and absolute URLs
Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings

settings = SpiderSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Spider(settings)
spider.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = SpiderSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Spider(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

If you are a first time contributor you can pick a good-first-issue and get started.
Please feel free to ask questions.
Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
We are working on doing our first major release. Please check this issue and see if anything interests you.

Dev setup

Install poetry in your system pipx install poetry
Clone the repo you forked
Create a venv or use poetry shell
Run poetry install --with dev
pre-commit install (see)
pre-commit install --hook-type pre-push

Before raising a PR. Please make sure you have these checks covered

An issue exists or is created which address the PR
Tests are written for the changes
All lint/test passes

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.5.0

Jul 11, 2024

This version

0.4.0b0 pre-release

Jun 22, 2024

0.3.0

Jun 13, 2024

0.2.0

Jun 12, 2024

0.1.5

Jun 12, 2024

0.1.4

Jun 12, 2024

0.1.3

Jun 12, 2024

0.1.2

Jun 12, 2024

0.1.1

Jun 12, 2024

0.1

Jun 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiny_web_crawler-0.4.0b0.tar.gz (8.6 kB view hashes)

Uploaded Jun 22, 2024 Source

Built Distribution

tiny_web_crawler-0.4.0b0-py3-none-any.whl (10.5 kB view hashes)

Uploaded Jun 22, 2024 Python 3

Hashes for tiny_web_crawler-0.4.0b0.tar.gz

Hashes for tiny_web_crawler-0.4.0b0.tar.gz
Algorithm	Hash digest
SHA256	`21b81d7e58044b1202938ec96b1e98058603d677dea5d67cb4187e7c0739863b`
MD5	`0c848d675115920bf97a516873cbe14a`
BLAKE2b-256	`c7094a4e22f4d428dfebcddd121dec0fce193a5146ed5edf28dc1dcf4d7c6bf1`

Hashes for tiny_web_crawler-0.4.0b0-py3-none-any.whl

Hashes for tiny_web_crawler-0.4.0b0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e2c5e857392def79fa2276a7fb2b1ee7ce39e41669fe7e5e87e4305ed761f42`
MD5	`61508059e504748f319232b6908366f5`
BLAKE2b-256	`d2498405420a44fc1dadfb7884947d6c12812f86548e538f45706c31589f1053`