A comprehensive web content scraping tool for text, images, audio and video

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Web Content Scraper - README

Overview

The Web Content Scraper is a comprehensive Python tool for extracting and processing various types of web content including text, images, audio, video, and tabular data. The package is organized into three main classes with distinct functionalities:

html: Web content extraction and parsing
run: Direct content downloading
show: Content display and playback

Features

1. `html` Class

Text Extraction

html.txt(mode, url, content_class, next_page_class, content_tag='div', next_page_tag='div', base_url='https:/', link_index=None)

mode: Extraction mode ('br' or 'p')
url: Starting URL
content_class: Class of content container
next_page_class: Class of next page link container
content_tag: HTML tag of content (default 'div')
next_page_tag: HTML tag containing next page link (default 'div')
base_url: Base URL for relative links (default 'https:/')
link_index: Index of tag if multiple exist (default None)

Image Downloading

html.img(url, container_class=None, url_prefix=None)

url: Target webpage URL
container_class: Class of image container (optional)
url_prefix: URL prefix for relative image paths (optional)

Audio Downloading

html.audio(url, container_class=None)

url: Target webpage URL
container_class: Class of audio container (optional)

Table Extraction

html.table(url, sort_order=None, sort_column='')

url: Webpage URL containing table
sort_order: None/True/False for no sort/ascending/descending
sort_column: Column name to sort by

2. `run` Class

Direct Downloads

run.music(url, output_name='1')
run.video(url, output_name='1', url_prefix=None)
run.txt(url, output_name='1')
run.table(url, sort_order=None, sort_column='')

url: Direct media URL
output_name: Output filename (without extension)
url_prefix: URL prefix for video fragments (optional)

3. `show` Class

Content Display

show.txt(mode, filename, start=1, end=1)
show.image(filename)
show.music(filename)
show.video(filename)

mode: Display mode ('连续' for sequence or '单个' for single file)
filename: Base filename (without extension)
start: First file in sequence
end: Last file in sequence

Excel Handling Functions

handle_excel(mode='merge')

mode: Operation mode ('merge', 'statistics', or 'duplicate')

Dependencies

Core:
- requests>=2.25.0
- beautifulsoup4>=4.9.0
- lxml>=4.6.0
- Pillow>=8.0.0
Media:
- audioplayer>=0.7
- moviepy>=1.0.0
Data:
- pandas>=1.2.0
- openpyxl>=3.0.0

Usage Examples

Text Extraction

# Extract text from multiple pages
html.txt('br', 'https://example.com/page1', 'article-content', 'pagination', 'div', 'nav', 'https://example.com', 0)

Image Download

# Download all images from a gallery
html.img('https://example.com/gallery', 'gallery-container', 'https://cdn.example.com')

Play Downloaded Content

# Play the first downloaded audio file
show.music('1')

Notes

Always check website terms of service before scraping
Consider adding delays between requests to avoid overloading servers
The User-Agent header mimics Chrome browser to reduce blocking
Error handling is basic - consider adding more specific exception handling

For more detailed examples, see the examples/ directory in the package.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.2.0

Jun 16, 2025

0.1.8

May 27, 2025

This version

0.1.6

May 8, 2025

0.1.5

May 8, 2025

0.1.1

May 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.6.tar.gz (5.1 kB view details)

Uploaded May 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

requests_ss-0.1.6-py3-none-any.whl (6.8 kB view details)

Uploaded May 8, 2025 Python 3

File details

Details for the file requests_ss-0.1.6.tar.gz.

File metadata

Download URL: requests_ss-0.1.6.tar.gz
Upload date: May 8, 2025
Size: 5.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`495aee6faec03afac3fe377a019931d18a68002dc25cdb023f01df7ac171cd27`
MD5	`ad6dab6d85a520aede07ca3408fd2309`
BLAKE2b-256	`9015dcf6c1d9d1b1078293bb1e4b80da817f84f3316104880463c15fce121896`

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.6-py3-none-any.whl.

File metadata

Download URL: requests_ss-0.1.6-py3-none-any.whl
Upload date: May 8, 2025
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`018029a43c13ec0482bfd7242ba269c168ec706290cd3f57b57782a77a9c0bf0`
MD5	`08eedff2b918d90cda0b21b021fc6647`
BLAKE2b-256	`0ab195e14850dcb4b5a490d4d6cda9f1f49d78a5c16602d368c33842b647d7dc`

See more details on using hashes here.

requests-ss 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Content Scraper - README

Overview

Features

1. html Class

Text Extraction

Image Downloading

Audio Downloading

Table Extraction

2. run Class

Direct Downloads

3. show Class

Content Display

Excel Handling Functions

Dependencies

Usage Examples

Text Extraction

Image Download

Play Downloaded Content

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `html` Class

2. `run` Class

3. `show` Class