A simple and efficient web crawler in Python.
Project description
Tiny Web Crawler
A simple and efficient web crawler for Python.
Features
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Installation
Install using pip:
pip install tiny-web-crawler
Usage
from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings
settings = SpiderSettings(
root_url = 'http://github.com',
max_links = 2
)
spider = Spider(settings)
spider.start()
# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0
settings = SpiderSettings(
root_url = 'https://github.com',
max_links = 5,
max_workers = 5,
delay = 1,
verbose = False
)
spider = Spider(settings)
spider.start()
Output Format
Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}
Contributing
Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issue
and get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issue
and see if anything interests you.
Dev setup
- Install poetry in your system
pipx install poetry
- Clone the repo you forked
- Create a venv or use
poetry shell
- Run
poetry install --with dev
pre-commit install
(see)pre-commit install --hook-type pre-push
Before raising a PR. Please make sure you have these checks covered
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tiny_web_crawler-0.5.0.tar.gz
.
File metadata
- Download URL: tiny_web_crawler-0.5.0.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c75bc6ecfe8e81480647811ed615bec3de381ef01f93023ca8bc6b770de4820 |
|
MD5 | 7d87706c5713473e6228fb2c550e7e47 |
|
BLAKE2b-256 | a3367cc4cf7870c4f8315168180871239da91fc88cf4c84f9d349f37a3031066 |
File details
Details for the file tiny_web_crawler-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: tiny_web_crawler-0.5.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fd176aa2d8595ca1a7e2e94770cf92de18a5a8983db7eecb629c9dd34b45f84 |
|
MD5 | 345171dd443af121c4ad13575486aa81 |
|
BLAKE2b-256 | 8d335f82de03623a7e97117e3667ea455702aa9a3e86a23be214384b429d30d3 |