A comprehensive web content scraping tool for text, images, audio and video

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: `html`, `run`, and `show`, each serving different purposes in the content scraping and display process. Features

1. `html` Class

Text Extraction: Extracts text content from HTML elements (div, p, etc.) and saves it to text files.
Image Downloading: Downloads images from specified HTML elements or all images on a page.
Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. `run` Class

Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.
Video Processing: Handles video content with optional URL prefixing.

3. `show` Class

Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

'br' or 'p': Specifies the extraction mode.
url: The starting URL to scrape.
class_name: The class of the element containing the text.
next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

url: The URL of the page containing images.
container_class: Optional. The class of the container element holding the images.
prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

url: The URL of the page containing audio files.
container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

requests: For making HTTP requests.
bs4 (BeautifulSoup): For parsing HTML.
lxml: As a parser for BeautifulSoup.
PIL (Pillow): For image display.
audioplayer: For audio playback.
moviepy: For video playback.

Notes

Ensure all dependencies are installed before running the script.
The script includes error handling with basic try-except blocks.
User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: `html`, `run`, and `show`, each serving different purposes in the content scraping and display process. Features

1. `html` Class

Text Extraction: Extracts text content from HTML elements (div, p, etc.) and saves it to text files.
Image Downloading: Downloads images from specified HTML elements or all images on a page.
Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. `run` Class

Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.
Video Processing: Handles video content with optional URL prefixing.

3. `show` Class

Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

'br' or 'p': Specifies the extraction mode.
url: The starting URL to scrape.
class_name: The class of the element containing the text.
next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

url: The URL of the page containing images.
container_class: Optional. The class of the container element holding the images.
prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

url: The URL of the page containing audio files.
container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

requests: For making HTTP requests.
bs4 (BeautifulSoup): For parsing HTML.
lxml: As a parser for BeautifulSoup.
PIL (Pillow): For image display.
audioplayer: For audio playback.
moviepy: For video playback.

Notes

Ensure all dependencies are installed before running the script.
The script includes error handling with basic try-except blocks.
User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.2.0

Jun 16, 2025

0.1.8

May 27, 2025

0.1.6

May 8, 2025

0.1.5

May 8, 2025

This version

0.1.1

May 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.1.tar.gz (4.7 kB view details)

Uploaded May 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

requests_ss-0.1.1-py3-none-any.whl (4.7 kB view details)

Uploaded May 5, 2025 Python 3

File details

Details for the file requests_ss-0.1.1.tar.gz.

File metadata

Download URL: requests_ss-0.1.1.tar.gz
Upload date: May 5, 2025
Size: 4.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`12538687ef76b57d164dade39e08b5670f23f474d467f2037a653ce74d702f27`
MD5	`3b791fdc9f805a9217a0bf754d967509`
BLAKE2b-256	`e52d24444e151ebfae67fda080f3f85fb6b00fb9b145fdb42ae6cac07424c3e1`

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.1-py3-none-any.whl.

File metadata

Download URL: requests_ss-0.1.1-py3-none-any.whl
Upload date: May 5, 2025
Size: 4.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`034b12e35038afa2f0485fc895ea22336deb9460f80630c8bd54aba05bbd387b`
MD5	`92962df9887b4a2c2d0baed3e37e8345`
BLAKE2b-256	`455240892c76bf0060ec7f14b59b5f3a9e8cc2a3e9dc7467b420aafcc89d6f0f`

See more details on using hashes here.

requests-ss 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTML Content Scraper - README

Overview

1. html Class

2. run Class

3. show Class

Usage

Text Extraction

Image Downloading

Audio Downloading

Direct Media Download

Display Content

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

Notes

Example

Download images from a webpage

Display the first downloaded image

Overview

1. html Class

2. run Class

3. show Class

Usage

Text Extraction

Image Downloading

Audio Downloading

Direct Media Download

Display Content

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

Notes

Example

Download images from a webpage

Display the first downloaded image

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `html` Class

2. `run` Class

3. `show` Class

1. `html` Class

2. `run` Class

3. `show` Class