No project description provided
Project description
Python Web Scraper
This script utilizes the Playwright library for scraping websites and generating a knowledgebase.
Main Function: scrape_website(config: Config)
Parameters
config: An object containing URLs to scrape, output file name, and a limit on pages to crawl.
Workflow
- Launch Chromium Browser: Uses Playwright to start a new browser instance.
- URL Iteration: For each URL in the
Configobject:- Sitemap Processing:
- Navigate to
sitemap.xml. - Raise
NoSitemapErrorif not found, else extract URLs.
- Navigate to
- Page Processing: For each URL in the sitemap:
- Stop if
max_pages_to_crawlis reached. - Navigate to the URL and extract the page content.
- Create a
WebPageobject with URL and content. - Clean content and add to
Knowledgebaseobject.
- Stop if
- Sitemap Processing:
Output Generation
- Knowledgebase to JSON: Writes the
Knowledgebaseobject to a JSON file. - Unique Filenames: Uses
get_output_filename(file_name: str)to ensure unique file names. - Browser Closure: Closes the browser and proceeds to the next URL.
- Error Handling: Prints
NoSitemapErrorto console if encountered.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tedfulk_kb_pycrawler-0.1.3.tar.gz.
File metadata
- Download URL: tedfulk_kb_pycrawler-0.1.3.tar.gz
- Upload date:
- Size: 2.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
776adf7e4b87d16e7d9842eab751ff5b5798c222acdcdfa800c598a2057e389d
|
|
| MD5 |
16358be80815902b239d6eeeca6133a6
|
|
| BLAKE2b-256 |
b670a1d57e57a9479ea58419f01dd776e88220f8ea34fff2ccfeead2e92dc068
|
File details
Details for the file tedfulk_kb_pycrawler-0.1.3-py3-none-any.whl.
File metadata
- Download URL: tedfulk_kb_pycrawler-0.1.3-py3-none-any.whl
- Upload date:
- Size: 3.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25d6b27e76e948b5fb2f1fc305ddd5d88f6112d1ba3a628441ada0ca3fd2b02a
|
|
| MD5 |
af46ad3623ca4045938156afc73e1a78
|
|
| BLAKE2b-256 |
eba0f5b380cfe3cd398db3db8bde74667efc47fb1697187c6a198cfb5b620ba4
|