A comprehensive web content scraping tool for text, images, audio and video
Project description
Web Scraping and Data Processing Toolkit
Overview
This Python module provides a comprehensive set of tools for web scraping, data extraction, and basic data processing. It includes functionality for handling text, images, audio, video, and tabular data from web sources.
Installation
Ensure you have Python 3.6+ installed, then install required dependencies:
pip install requests beautifulsoup4 lxml pandas Pillow audioplayer moviepy
Module Structure
1. html Class - Web Content Extraction
Methods:
txts(): Extract text content from web pages with pagination supporttxt(): Basic text extraction from paragraphs or entire pagesimg(): Download images from web pagesaudio(): Extract audio files from web pagestable(): Extract and process HTML tables
2. run Class - Direct Content Download
Methods:
music(): Download audio files directlyvideo(): Download video contenttxt(): Download and save text contenttable(): Extract and process HTML tables
3. show Class - Content Display
Methods:
txt(): Display text content from filesimage(): Display downloaded imagesmusic(): Play audio filesvideo(): Preview video files
4. Excel Utilities
handle_excel(): Provides three modes:merge: Combine multiple Excel filesstatistics: Generate value counts for specified dataduplicate: Remove duplicates from Excel data
Usage Examples
Basic Text Extraction
html.txt("https://example.com", mode='p', txt='output')
Image Download
html.img("https://example.com/gallery", img_div_class="gallery")
Table Processing
html.table("https://example.com/data", turn=True, arrange='price')
Excel Operations
handle_excel(mode='merge') # Follow interactive prompts
Features
- User-Agent Spoofing: All requests include browser-like headers
- Pagination Support: Automatically follow "next page" links
- Flexible Content Handling: Works with various HTML structures
- Data Processing: Sort and clean extracted data
- Media Playback: Built-in preview for images, audio and video
Notes
- Use this tool responsibly and respect website terms of service
- Some methods may require additional error handling for production use
- Media playback features need optional dependencies (Pillow, audioplayer, moviepy)
License
This project is provided as-is without warranty. Users are responsible for complying with all applicable laws and website terms of service when using this tool.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file requests_ss-0.2.0.tar.gz.
File metadata
- Download URL: requests_ss-0.2.0.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcc16c4a3817723279fceaad7a2f5f44bbb3a7348e395edd0ef94142c97546f3
|
|
| MD5 |
2741277d13f8f578ba6f54e055cdc385
|
|
| BLAKE2b-256 |
c1c05c33c769e6720fe268b2fc5e58fac6ba459d302531d4c2978a9df39dc3d6
|
File details
Details for the file requests_ss-0.2.0-py3-none-any.whl.
File metadata
- Download URL: requests_ss-0.2.0-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8defcc9a3f7ff9fc7ed5e44c5940268fad6d95ad66a0970e8c0062d9e983733
|
|
| MD5 |
091c2a60dfd5203793d2629825dfefad
|
|
| BLAKE2b-256 |
c2e494ead4c1335854d701cb187028e9fe64f85e3708d084371ac1195ce6bee3
|