A simple and efficient web crawler in Python.
Project description
DataCrawl 🕸
A simple and efficient web crawler for Python.
Features
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Installation
Install using pip:
pip install datacrawl
Usage
from datacrawl import Datacrawl
from datacrawl import CrawlSettings
settings = CrawlSettings(
root_url = 'http://github.com',
max_links = 2
)
spider = Datacrawl(settings)
spider.start()
# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0
settings = CrawlSettings(
root_url = 'https://github.com',
max_links = 5,
max_workers = 5,
delay = 1,
verbose = False
)
spider = Datacrawl(settings)
spider.start()
Output Format
Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}
Contributing
Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issue
and get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issue
and see if anything interests you.
Dev setup
- Install poetry in your system
pipx install poetry
- Clone the repo you forked
- Create a venv or use
poetry shell
- Run
poetry install --with dev
pre-commit install
(see)pre-commit install --hook-type pre-push
Before raising a PR. Please make sure you have these checks covered
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
datacrawl-0.6.1.tar.gz
(9.9 kB
view details)
Built Distribution
datacrawl-0.6.1-py3-none-any.whl
(14.8 kB
view details)
File details
Details for the file datacrawl-0.6.1.tar.gz
.
File metadata
- Download URL: datacrawl-0.6.1.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97bc8e173842a8c1381f6717695d484562faed54cbeb4cef03b6ef3bb0982ccd |
|
MD5 | d18e9147964e6bdcebff84f847135283 |
|
BLAKE2b-256 | 0bfb189f7c6e497dd671d6a1fbae3558ee41b05a1bd0badd00c554f7687a4e94 |
File details
Details for the file datacrawl-0.6.1-py3-none-any.whl
.
File metadata
- Download URL: datacrawl-0.6.1-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a340afa134121e794a9e1f348203b080bd9754d0eac65eecbbc847b504f8ea9 |
|
MD5 | 9ccedeb407154125ccb28f5526ed2571 |
|
BLAKE2b-256 | 173f02a9e4bf9cb2c932a068afb9a203b71418839d612bd29d5b20e2c1d48ca7 |