Skip to main content

A simple and efficient web crawler in Python.

Project description

DataCrawl 🕸

Coverage badge Stable Version License: MIT Discord

A simple and efficient web crawler for Python.

Features

  • Crawl web pages and extract links starting from a root URL recursively
  • Concurrent workers and custom delay
  • Handle relative and absolute URLs
  • Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks

Installation

Install using pip:

pip install datacrawl

Usage

from datacrawl import Datacrawl
from datacrawl import CrawlSettings

settings = CrawlSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Datacrawl(settings)
spider.start()


# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = CrawlSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Datacrawl(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

  • If you are a first time contributor you can pick a good-first-issue and get started.
  • Please feel free to ask questions.
  • Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
  • We are working on doing our first major release. Please check this issue and see if anything interests you.

Dev setup

  • Install poetry in your system pipx install poetry
  • Clone the repo you forked
  • Create a venv or use poetry shell
  • Run poetry install --with dev
  • pre-commit install (see)
  • pre-commit install --hook-type pre-push

Before raising a PR. Please make sure you have these checks covered

  • An issue exists or is created which address the PR
  • Tests are written for the changes
  • All lint/test passes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacrawl-0.6.1.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

datacrawl-0.6.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file datacrawl-0.6.1.tar.gz.

File metadata

  • Download URL: datacrawl-0.6.1.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.4.0

File hashes

Hashes for datacrawl-0.6.1.tar.gz
Algorithm Hash digest
SHA256 97bc8e173842a8c1381f6717695d484562faed54cbeb4cef03b6ef3bb0982ccd
MD5 d18e9147964e6bdcebff84f847135283
BLAKE2b-256 0bfb189f7c6e497dd671d6a1fbae3558ee41b05a1bd0badd00c554f7687a4e94

See more details on using hashes here.

File details

Details for the file datacrawl-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: datacrawl-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.4.0

File hashes

Hashes for datacrawl-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a340afa134121e794a9e1f348203b080bd9754d0eac65eecbbc847b504f8ea9
MD5 9ccedeb407154125ccb28f5526ed2571
BLAKE2b-256 173f02a9e4bf9cb2c932a068afb9a203b71418839d612bd29d5b20e2c1d48ca7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page