Skip to main content

A comprehensive web content scraping tool for text, images, audio and video

Project description

Web Content Scraper - README

Overview

The Web Content Scraper is a comprehensive Python tool for extracting and processing various types of web content including text, images, audio, video, and tabular data. The package is organized into three main classes with distinct functionalities:

  • html: Web content extraction and parsing
  • run: Direct content downloading
  • show: Content display and playback

Features

1. html Class

Text Extraction

html.txt(mode, url, content_class, next_page_class, content_tag='div', next_page_tag='div', base_url='https:/', link_index=None)
  • mode: Extraction mode ('br' or 'p')
  • url: Starting URL
  • content_class: Class of content container
  • next_page_class: Class of next page link container
  • content_tag: HTML tag of content (default 'div')
  • next_page_tag: HTML tag containing next page link (default 'div')
  • base_url: Base URL for relative links (default 'https:/')
  • link_index: Index of tag if multiple exist (default None)

Image Downloading

html.img(url, container_class=None, url_prefix=None)
  • url: Target webpage URL
  • container_class: Class of image container (optional)
  • url_prefix: URL prefix for relative image paths (optional)

Audio Downloading

html.audio(url, container_class=None)
  • url: Target webpage URL
  • container_class: Class of audio container (optional)

Table Extraction

html.table(url, sort_order=None, sort_column='')
  • url: Webpage URL containing table
  • sort_order: None/True/False for no sort/ascending/descending
  • sort_column: Column name to sort by

2. run Class

Direct Downloads

run.music(url, output_name='1')
run.video(url, output_name='1', url_prefix=None)
run.txt(url, output_name='1')
run.table(url, sort_order=None, sort_column='')
  • url: Direct media URL
  • output_name: Output filename (without extension)
  • url_prefix: URL prefix for video fragments (optional)

3. show Class

Content Display

show.txt(mode, filename, start=1, end=1)
show.image(filename)
show.music(filename)
show.video(filename)
  • mode: Display mode ('连续' for sequence or '单个' for single file)
  • filename: Base filename (without extension)
  • start: First file in sequence
  • end: Last file in sequence

Excel Handling Functions

handle_excel(mode='merge')
  • mode: Operation mode ('merge', 'statistics', or 'duplicate')

Dependencies

  • Core:

    • requests>=2.25.0
    • beautifulsoup4>=4.9.0
    • lxml>=4.6.0
    • Pillow>=8.0.0
  • Media:

    • audioplayer>=0.7
    • moviepy>=1.0.0
  • Data:

    • pandas>=1.2.0
    • openpyxl>=3.0.0

Usage Examples

Text Extraction

# Extract text from multiple pages
html.txt('br', 'https://example.com/page1', 'article-content', 'pagination', 'div', 'nav', 'https://example.com', 0)

Image Download

# Download all images from a gallery
html.img('https://example.com/gallery', 'gallery-container', 'https://cdn.example.com')

Play Downloaded Content

# Play the first downloaded audio file
show.music('1')

Notes

  1. Always check website terms of service before scraping
  2. Consider adding delays between requests to avoid overloading servers
  3. The User-Agent header mimics Chrome browser to reduce blocking
  4. Error handling is basic - consider adding more specific exception handling

For more detailed examples, see the examples/ directory in the package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.6.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

requests_ss-0.1.6-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file requests_ss-0.1.6.tar.gz.

File metadata

  • Download URL: requests_ss-0.1.6.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.6.tar.gz
Algorithm Hash digest
SHA256 495aee6faec03afac3fe377a019931d18a68002dc25cdb023f01df7ac171cd27
MD5 ad6dab6d85a520aede07ca3408fd2309
BLAKE2b-256 9015dcf6c1d9d1b1078293bb1e4b80da817f84f3316104880463c15fce121896

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: requests_ss-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 018029a43c13ec0482bfd7242ba269c168ec706290cd3f57b57782a77a9c0bf0
MD5 08eedff2b918d90cda0b21b021fc6647
BLAKE2b-256 0ab195e14850dcb4b5a490d4d6cda9f1f49d78a5c16602d368c33842b647d7dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page