A comprehensive web content scraping tool for text, images, audio and video
Project description
Web Content Scraper - README
Overview
The Web Content Scraper is a comprehensive Python tool for extracting and processing various types of web content including text, images, audio, video, and tabular data. The package is organized into three main classes with distinct functionalities:
html: Web content extraction and parsingrun: Direct content downloadingshow: Content display and playback
Features
1. html Class
Text Extraction
html.txt(mode, url, content_class, next_page_class, content_tag='div', next_page_tag='div', base_url='https:/', link_index=None)
mode: Extraction mode ('br' or 'p')url: Starting URLcontent_class: Class of content containernext_page_class: Class of next page link containercontent_tag: HTML tag of content (default 'div')next_page_tag: HTML tag containing next page link (default 'div')base_url: Base URL for relative links (default 'https:/')link_index: Index of tag if multiple exist (default None)
Image Downloading
html.img(url, container_class=None, url_prefix=None)
url: Target webpage URLcontainer_class: Class of image container (optional)url_prefix: URL prefix for relative image paths (optional)
Audio Downloading
html.audio(url, container_class=None)
url: Target webpage URLcontainer_class: Class of audio container (optional)
Table Extraction
html.table(url, sort_order=None, sort_column='')
url: Webpage URL containing tablesort_order: None/True/False for no sort/ascending/descendingsort_column: Column name to sort by
2. run Class
Direct Downloads
run.music(url, output_name='1')
run.video(url, output_name='1', url_prefix=None)
run.txt(url, output_name='1')
run.table(url, sort_order=None, sort_column='')
url: Direct media URLoutput_name: Output filename (without extension)url_prefix: URL prefix for video fragments (optional)
3. show Class
Content Display
show.txt(mode, filename, start=1, end=1)
show.image(filename)
show.music(filename)
show.video(filename)
mode: Display mode ('连续' for sequence or '单个' for single file)filename: Base filename (without extension)start: First file in sequenceend: Last file in sequence
Excel Handling Functions
handle_excel(mode='merge')
mode: Operation mode ('merge', 'statistics', or 'duplicate')
Dependencies
-
Core:
requests>=2.25.0beautifulsoup4>=4.9.0lxml>=4.6.0Pillow>=8.0.0
-
Media:
audioplayer>=0.7moviepy>=1.0.0
-
Data:
pandas>=1.2.0openpyxl>=3.0.0
Usage Examples
Text Extraction
# Extract text from multiple pages
html.txt('br', 'https://example.com/page1', 'article-content', 'pagination', 'div', 'nav', 'https://example.com', 0)
Image Download
# Download all images from a gallery
html.img('https://example.com/gallery', 'gallery-container', 'https://cdn.example.com')
Play Downloaded Content
# Play the first downloaded audio file
show.music('1')
Notes
- Always check website terms of service before scraping
- Consider adding delays between requests to avoid overloading servers
- The User-Agent header mimics Chrome browser to reduce blocking
- Error handling is basic - consider adding more specific exception handling
For more detailed examples, see the examples/ directory in the package.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file requests_ss-0.1.6.tar.gz.
File metadata
- Download URL: requests_ss-0.1.6.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
495aee6faec03afac3fe377a019931d18a68002dc25cdb023f01df7ac171cd27
|
|
| MD5 |
ad6dab6d85a520aede07ca3408fd2309
|
|
| BLAKE2b-256 |
9015dcf6c1d9d1b1078293bb1e4b80da817f84f3316104880463c15fce121896
|
File details
Details for the file requests_ss-0.1.6-py3-none-any.whl.
File metadata
- Download URL: requests_ss-0.1.6-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
018029a43c13ec0482bfd7242ba269c168ec706290cd3f57b57782a77a9c0bf0
|
|
| MD5 |
08eedff2b918d90cda0b21b021fc6647
|
|
| BLAKE2b-256 |
0ab195e14850dcb4b5a490d4d6cda9f1f49d78a5c16602d368c33842b647d7dc
|