Skip to main content

A comprehensive web content scraping tool for text, images, audio and video

Project description

Web Scraping and Data Processing Toolkit

Overview

This Python module provides a comprehensive set of tools for web scraping, data extraction, and basic data processing. It includes functionality for handling text, images, audio, video, and tabular data from web sources.

Installation

Ensure you have Python 3.6+ installed, then install required dependencies:

pip install requests beautifulsoup4 lxml pandas Pillow audioplayer moviepy

Module Structure

1. html Class - Web Content Extraction

Methods:

  • txts(): Extract text content from web pages with pagination support
  • txt(): Basic text extraction from paragraphs or entire pages
  • img(): Download images from web pages
  • audio(): Extract audio files from web pages
  • table(): Extract and process HTML tables

2. run Class - Direct Content Download

Methods:

  • music(): Download audio files directly
  • video(): Download video content
  • txt(): Download and save text content
  • table(): Extract and process HTML tables

3. show Class - Content Display

Methods:

  • txt(): Display text content from files
  • image(): Display downloaded images
  • music(): Play audio files
  • video(): Preview video files

4. Excel Utilities

  • handle_excel(): Provides three modes:
    • merge: Combine multiple Excel files
    • statistics: Generate value counts for specified data
    • duplicate: Remove duplicates from Excel data

Usage Examples

Basic Text Extraction

html.txt("https://example.com", mode='p', txt='output')

Image Download

html.img("https://example.com/gallery", img_div_class="gallery")

Table Processing

html.table("https://example.com/data", turn=True, arrange='price')

Excel Operations

handle_excel(mode='merge')  # Follow interactive prompts

Features

  • User-Agent Spoofing: All requests include browser-like headers
  • Pagination Support: Automatically follow "next page" links
  • Flexible Content Handling: Works with various HTML structures
  • Data Processing: Sort and clean extracted data
  • Media Playback: Built-in preview for images, audio and video

Notes

  1. Use this tool responsibly and respect website terms of service
  2. Some methods may require additional error handling for production use
  3. Media playback features need optional dependencies (Pillow, audioplayer, moviepy)

License

This project is provided as-is without warranty. Users are responsible for complying with all applicable laws and website terms of service when using this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.2.0.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

requests_ss-0.2.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file requests_ss-0.2.0.tar.gz.

File metadata

  • Download URL: requests_ss-0.2.0.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.2.0.tar.gz
Algorithm Hash digest
SHA256 dcc16c4a3817723279fceaad7a2f5f44bbb3a7348e395edd0ef94142c97546f3
MD5 2741277d13f8f578ba6f54e055cdc385
BLAKE2b-256 c1c05c33c769e6720fe268b2fc5e58fac6ba459d302531d4c2978a9df39dc3d6

See more details on using hashes here.

File details

Details for the file requests_ss-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: requests_ss-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f8defcc9a3f7ff9fc7ed5e44c5940268fad6d95ad66a0970e8c0062d9e983733
MD5 091c2a60dfd5203793d2629825dfefad
BLAKE2b-256 c2e494ead4c1335854d701cb187028e9fe64f85e3708d084371ac1195ce6bee3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page