No project description provided

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Python Web Scraper

This script utilizes the Playwright library for scraping websites and generating a knowledgebase.

Main Function: `scrape_website(config: Config)`

Parameters

config: An object containing URLs to scrape, output file name, and a limit on pages to crawl.

Workflow

Launch Chromium Browser: Uses Playwright to start a new browser instance.
URL Iteration: For each URL in the Config object:
- Sitemap Processing:
  - Navigate to sitemap.xml.
  - Raise NoSitemapError if not found, else extract URLs.
- Page Processing: For each URL in the sitemap:
  - Stop if max_pages_to_crawl is reached.
  - Navigate to the URL and extract the page content.
  - Create a WebPage object with URL and content.
  - Clean content and add to Knowledgebase object.

Output Generation

Knowledgebase to JSON: Writes the Knowledgebase object to a JSON file.
Unique Filenames: Uses get_output_filename(file_name: str) to ensure unique file names.
Browser Closure: Closes the browser and proceeds to the next URL.
Error Handling: Prints NoSitemapError to console if encountered.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.6

Jan 22, 2024

0.1.5

Jan 22, 2024

0.1.4

Jan 9, 2024

This version

0.1.3

Jan 9, 2024

0.1.2

Jan 9, 2024

0.1.1

Jan 9, 2024

0.1.0

Jan 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tedfulk_kb_pycrawler-0.1.3.tar.gz (2.7 kB view hashes)

Uploaded Jan 9, 2024 Source

Built Distribution

tedfulk_kb_pycrawler-0.1.3-py3-none-any.whl (3.7 kB view hashes)

Uploaded Jan 9, 2024 Python 3

Hashes for tedfulk_kb_pycrawler-0.1.3.tar.gz

Hashes for tedfulk_kb_pycrawler-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`776adf7e4b87d16e7d9842eab751ff5b5798c222acdcdfa800c598a2057e389d`
MD5	`16358be80815902b239d6eeeca6133a6`
BLAKE2b-256	`b670a1d57e57a9479ea58419f01dd776e88220f8ea34fff2ccfeead2e92dc068`

Hashes for tedfulk_kb_pycrawler-0.1.3-py3-none-any.whl

Hashes for tedfulk_kb_pycrawler-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25d6b27e76e948b5fb2f1fc305ddd5d88f6112d1ba3a628441ada0ca3fd2b02a`
MD5	`af46ad3623ca4045938156afc73e1a78`
BLAKE2b-256	`eba0f5b380cfe3cd398db3db8bde74667efc47fb1697187c6a198cfb5b620ba4`

tedfulk-kb-pycrawler 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Python Web Scraper

Main Function: `scrape_website(config: Config)`

Parameters

Workflow

Output Generation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

tedfulk-kb-pycrawler 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Python Web Scraper

Main Function: scrape_website(config: Config)

Parameters

Workflow

Output Generation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Main Function: `scrape_website(config: Config)`