Skip to main content

A comprehensive web content scraping tool for text, images, audio and video

Project description

Web Scraping and Data Processing Toolkit

Overview

This Python module provides a comprehensive set of tools for web scraping, data extraction, and basic data processing. It includes functionality for handling text, images, audio, video, and tabular data from web sources.

Installation

Ensure you have Python 3.6+ installed, then install required dependencies:

pip install requests beautifulsoup4 lxml pandas Pillow audioplayer moviepy

Module Structure

1. html Class - Web Content Extraction

Methods:

  • txts(): Extract text content from web pages with pagination support
  • txt(): Basic text extraction from paragraphs or entire pages
  • img(): Download images from web pages
  • audio(): Extract audio files from web pages
  • table(): Extract and process HTML tables

2. run Class - Direct Content Download

Methods:

  • music(): Download audio files directly
  • video(): Download video content
  • txt(): Download and save text content
  • table(): Extract and process HTML tables

3. show Class - Content Display

Methods:

  • txt(): Display text content from files
  • image(): Display downloaded images
  • music(): Play audio files
  • video(): Preview video files

4. Excel Utilities

  • handle_excel(): Provides three modes:
    • merge: Combine multiple Excel files
    • statistics: Generate value counts for specified data
    • duplicate: Remove duplicates from Excel data

Usage Examples

Basic Text Extraction

html.txt("https://example.com", mode='p', txt='output')

Image Download

html.img("https://example.com/gallery", img_div_class="gallery")

Table Processing

html.table("https://example.com/data", turn=True, arrange='price')

Excel Operations

handle_excel(mode='merge')  # Follow interactive prompts

Features

  • User-Agent Spoofing: All requests include browser-like headers
  • Pagination Support: Automatically follow "next page" links
  • Flexible Content Handling: Works with various HTML structures
  • Data Processing: Sort and clean extracted data
  • Media Playback: Built-in preview for images, audio and video

Notes

  1. Use this tool responsibly and respect website terms of service
  2. Some methods may require additional error handling for production use
  3. Media playback features need optional dependencies (Pillow, audioplayer, moviepy)

License

This project is provided as-is without warranty. Users are responsible for complying with all applicable laws and website terms of service when using this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.8.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

requests_ss-0.1.8-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file requests_ss-0.1.8.tar.gz.

File metadata

  • Download URL: requests_ss-0.1.8.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.8.tar.gz
Algorithm Hash digest
SHA256 bd749ca587215a57165537692a6c1341d21c0d3e55afc6a113d2749ff04fb899
MD5 c4d41cd86656ec31fc213017b4f05b7c
BLAKE2b-256 2de6293be73f2cd3c3323d317559fc77f0d2ac2757f34a300c830408219edce1

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: requests_ss-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 b4c25d071ee484063387fd3309d0c0f80061c96be4fc22c78d4d429a3869f96c
MD5 be2b1fb23193e64874998aa6bc9d34ff
BLAKE2b-256 de46bef1da9bf0f268610124060153509b5a63a7b5cdf61e6333d301f0b239d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page