A comprehensive web content scraping tool for text, images, audio and video
Project description
HTML Content Scraper - README
Overview
This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: html, run, and show, each serving different purposes in the content scraping and display process.
Features
1. html Class
-
Text Extraction: Extracts text content from HTML elements (
div,p, etc.) and saves it to text files. -
Image Downloading: Downloads images from specified HTML elements or all images on a page.
-
Audio Downloading: Extracts and downloads audio files from HTML
audioelements.
2. run Class
-
Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.
-
Video Processing: Handles video content with optional URL prefixing.
3. show Class
- Content Display: Displays downloaded content, including text files, images, audio, and video.
Usage
Text Extraction
python
html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)
-
'br'or'p': Specifies the extraction mode. -
url: The starting URL to scrape. -
class_name: The class of the element containing the text. -
next_page_class: The class of the element containing the link to the next page.
Image Downloading
python
html.img(url, container_class=None, prefix=None)
-
url: The URL of the page containing images. -
container_class: Optional. The class of the container element holding the images. -
prefix: Optional. A URL prefix to prepend to image sources.
Audio Downloading
python
html.audio(url, container_class=None)
-
url: The URL of the page containing audio files. -
container_class: Optional. The class of the container element holding the audio files.
Direct Media Download
python
run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text
Display Content
python
show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies
-
requests: For making HTTP requests. -
bs4(BeautifulSoup): For parsing HTML. -
lxml: As a parser for BeautifulSoup. -
PIL(Pillow): For image display. -
audioplayer: For audio playback. -
moviepy: For video playback.
Notes
-
Ensure all dependencies are installed before running the script.
-
The script includes error handling with basic
try-exceptblocks. -
User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.
Example
python
Download images from a webpage
html.img('https://example.com/gallery', 'image-container')
Display the first downloaded image
show.image('1')
This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.HTML Content Scraper - README
Overview
This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: html, run, and show, each serving different purposes in the content scraping and display process.
Features
1. html Class
-
Text Extraction: Extracts text content from HTML elements (
div,p, etc.) and saves it to text files. -
Image Downloading: Downloads images from specified HTML elements or all images on a page.
-
Audio Downloading: Extracts and downloads audio files from HTML
audioelements.
2. run Class
-
Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.
-
Video Processing: Handles video content with optional URL prefixing.
3. show Class
- Content Display: Displays downloaded content, including text files, images, audio, and video.
Usage
Text Extraction
python
html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)
-
'br'or'p': Specifies the extraction mode. -
url: The starting URL to scrape. -
class_name: The class of the element containing the text. -
next_page_class: The class of the element containing the link to the next page.
Image Downloading
python
html.img(url, container_class=None, prefix=None)
-
url: The URL of the page containing images. -
container_class: Optional. The class of the container element holding the images. -
prefix: Optional. A URL prefix to prepend to image sources.
Audio Downloading
python
html.audio(url, container_class=None)
-
url: The URL of the page containing audio files. -
container_class: Optional. The class of the container element holding the audio files.
Direct Media Download
python
run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text
Display Content
python
show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies
-
requests: For making HTTP requests. -
bs4(BeautifulSoup): For parsing HTML. -
lxml: As a parser for BeautifulSoup. -
PIL(Pillow): For image display. -
audioplayer: For audio playback. -
moviepy: For video playback.
Notes
-
Ensure all dependencies are installed before running the script.
-
The script includes error handling with basic
try-exceptblocks. -
User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.
Example
python
Download images from a webpage
html.img('https://example.com/gallery', 'image-container')
Display the first downloaded image
show.image('1')
This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file requests_ss-0.1.5.tar.gz.
File metadata
- Download URL: requests_ss-0.1.5.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c462d87e1ea4c80014fa6c9b7752cebe318ff27c693612a98cce72f81726b145
|
|
| MD5 |
9c93ab9edc4c2b7a72080be1c48a19b7
|
|
| BLAKE2b-256 |
30ab26c6596ef6fe96a9acdc055f3b01bd70941edc06d7bdc38c1d65cbe580af
|
File details
Details for the file requests_ss-0.1.5-py3-none-any.whl.
File metadata
- Download URL: requests_ss-0.1.5-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33e2327c9aad894fbd40e5b2beeef6e365339ce4a6048a538a0cbe80a6093005
|
|
| MD5 |
d1af5eef381483e87df40a41a131c449
|
|
| BLAKE2b-256 |
546e9a550a22b400dc87b629d8b430878b5efd236e05f3bb17a2e8f3b7c7f84d
|