A web crawler that uses graph traversal algorithms to crawl the web.

Project description

PySiteCrawler - A Simple Web Crawling Library

PySiteCrawler is a Python library designed for web crawling and data extraction, offering a simple and efficient way to explore web pages, extract text content, and manage links during the crawling process. The library is designed to provide versatile traversal methods, with additional traversal strategies planned for future updates. All scraped data is conveniently stored in .txt files for easy access and analysis.

Features

Breadth-First Search Crawling: Seamlessly traverse websites using a breadth-first search strategy.
Depth-First Search Crawling: Efficiently explore websites using a depth-first search strategy.
Text Extraction: Extract text content and titles from HTML pages for further analysis.
Headless Browsing: Use either GeckoDriver or ChromeDriver for headless browsing.

Prerequisites

Before using PySiteCrawler, ensure that you have the following prerequisites in place:

Python: PySiteCrawler requires Python (version 3.6 or higher). You can download the latest version of Python from here.
WebDriver Setup:
- GeckoDriver: For Firefox browser automation, download the latest GeckoDriver from here and make sure it is available in your system's PATH.
- ChromeDriver: For Chrome browser automation, download the latest ChromeDriver from here and make sure it is available in your system's PATH.

Installation

You can easily install PySiteCrawler using pip:

pip install PySiteCrawler

Classes and Functions

BFSWebCrawler

The BFSWebCrawler class provides the following functions and methods:

__init__(base_url, geckodriver_path=None, chromedriver_path=None, max_depth=None, headless=False): Initialize the BFSWebCrawler instance.
crawl(): Perform a breadth-first search crawl on the specified website.

DFSWebCrawler

The DFSWebCrawler class provides the following functions and methods:

__init__(base_url, geckodriver_path=None, chromedriver_path=None, max_depth=None, headless=False): Initialize the DFSWebCrawler instance.
crawl(): Perform a depth-first search crawl on the specified website.

Usage

Here's a quick example of how to use PySiteCrawler to perform a breadth-first search crawl on a website:

from PySiteCrawler.crawler.bfs_web_crawler import BFSWebCrawler

# Initialize a BFSWebCrawler
crawler = BFSWebCrawler("https://example.com", max_depth=2,
                         geckodriver_path=r"path/to/geckodriver")
crawler.crawl()

You can also specify the chromedriver_path parameter during initialization to use the ChromeDriver for crawling. (It is suggested to use the geckodriver as chromedriver causes some issue in loading the website correctly in headless mode)

from PySiteCrawler.crawler.dfs_web_crawler import DFSWebCrawler

# Initialize a DFSWebCrawler
crawler = DFSWebCrawler("https://example.com", max_depth=2,
                         chromedriver_path=r"path/to/chromedriver")
crawler.crawl()

Parameters

base_url: The starting URL for web crawling.
max_depth (optional): The maximum depth of crawling. Default is None (no limit).
geckodriver_path (optional): Path to GeckoDriver executable for Firefox. Default is None (uses ChromeDriver).
chromedriver_path (optional): Path to ChromeDriver executable for Chrome. Default is None (uses GeckoDriver).
headless (optional): If True, the browser will run in headless mode (no GUI display). If False, the browser GUI will be visible. Default is True.

Note: The base_url parameter and either geckodriver_path or chromedriver_path are necessary for PySiteCrawler to work correctly. Specify the appropriate WebDriver path based on your preferred browser automation. If geckodriver_path is provided, GeckoDriver will be used by default. If chromedriver_path is provided, ChromeDriver will be used for crawling. It is suggested to use GeckoDriver, as ChromeDriver may cause issues in loading websites correctly in headless mode.

Contribution

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.

Project details

Release history Release notifications | RSS feed

0.2.1

Sep 18, 2023

0.2.0

Sep 16, 2023

This version

0.1.2

Aug 26, 2023

0.1.1

Aug 15, 2023

0.1.0

Aug 14, 2023

0.0.2

Aug 14, 2023

0.0.1

Aug 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PySiteCrawler-0.1.2.tar.gz (4.7 kB view details)

Uploaded Aug 26, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

PySiteCrawler-0.1.2-py3-none-any.whl (6.0 kB view details)

Uploaded Aug 26, 2023 Python 3

File details

Details for the file PySiteCrawler-0.1.2.tar.gz.

File metadata

Download URL: PySiteCrawler-0.1.2.tar.gz
Upload date: Aug 26, 2023
Size: 4.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for PySiteCrawler-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e2dfe6eb95706b8e4a019e2d6525af46e5483881984ea91b3f6c606d29e8ea46`
MD5	`6ac43709e5e2c07c34fe32d14f9580e2`
BLAKE2b-256	`a0952c6c021f1f78b03f9df1549051d42fb0b2f1b310351b8772aaa2c5c4ea1e`

See more details on using hashes here.

File details

Details for the file PySiteCrawler-0.1.2-py3-none-any.whl.

File metadata

Download URL: PySiteCrawler-0.1.2-py3-none-any.whl
Upload date: Aug 26, 2023
Size: 6.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for PySiteCrawler-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`42fd2914f123926a6c75d9c9e19711107a07f4455193b42ee14f5fda1a683a53`
MD5	`ba71dcbe4e162756a9fdebb4d6951b3e`
BLAKE2b-256	`44167187e4b9d50061b1ed99fb9baf96a9732676931e59b5617a02b341b2ebe6`

See more details on using hashes here.

PySiteCrawler 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PySiteCrawler - A Simple Web Crawling Library

Features

Prerequisites

Installation

Classes and Functions

BFSWebCrawler

DFSWebCrawler

Usage

Parameters

Contribution

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes