Skip to main content

A comprehensive web content scraping tool for text, images, audio and video

Project description

HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: htmlrun, and show, each serving different purposes in the content scraping and display process. Features

1. html Class

  • Text Extraction: Extracts text content from HTML elements (divp, etc.) and saves it to text files.

  • Image Downloading: Downloads images from specified HTML elements or all images on a page.

  • Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. run Class

  • Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.

  • Video Processing: Handles video content with optional URL prefixing.

3. show Class

  • Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

  • 'br' or 'p': Specifies the extraction mode.

  • url: The starting URL to scrape.

  • class_name: The class of the element containing the text.

  • next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

  • url: The URL of the page containing images.

  • container_class: Optional. The class of the container element holding the images.

  • prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

  • url: The URL of the page containing audio files.

  • container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

  • requests: For making HTTP requests.

  • bs4 (BeautifulSoup): For parsing HTML.

  • lxml: As a parser for BeautifulSoup.

  • PIL (Pillow): For image display.

  • audioplayer: For audio playback.

  • moviepy: For video playback.

Notes

  • Ensure all dependencies are installed before running the script.

  • The script includes error handling with basic try-except blocks.

  • User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: htmlrun, and show, each serving different purposes in the content scraping and display process. Features

1. html Class

  • Text Extraction: Extracts text content from HTML elements (divp, etc.) and saves it to text files.

  • Image Downloading: Downloads images from specified HTML elements or all images on a page.

  • Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. run Class

  • Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.

  • Video Processing: Handles video content with optional URL prefixing.

3. show Class

  • Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

  • 'br' or 'p': Specifies the extraction mode.

  • url: The starting URL to scrape.

  • class_name: The class of the element containing the text.

  • next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

  • url: The URL of the page containing images.

  • container_class: Optional. The class of the container element holding the images.

  • prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

  • url: The URL of the page containing audio files.

  • container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

  • requests: For making HTTP requests.

  • bs4 (BeautifulSoup): For parsing HTML.

  • lxml: As a parser for BeautifulSoup.

  • PIL (Pillow): For image display.

  • audioplayer: For audio playback.

  • moviepy: For video playback.

Notes

  • Ensure all dependencies are installed before running the script.

  • The script includes error handling with basic try-except blocks.

  • User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

requests_ss-0.1.1-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file requests_ss-0.1.1.tar.gz.

File metadata

  • Download URL: requests_ss-0.1.1.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.1.tar.gz
Algorithm Hash digest
SHA256 12538687ef76b57d164dade39e08b5670f23f474d467f2037a653ce74d702f27
MD5 3b791fdc9f805a9217a0bf754d967509
BLAKE2b-256 e52d24444e151ebfae67fda080f3f85fb6b00fb9b145fdb42ae6cac07424c3e1

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: requests_ss-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 034b12e35038afa2f0485fc895ea22336deb9460f80630c8bd54aba05bbd387b
MD5 92962df9887b4a2c2d0baed3e37e8345
BLAKE2b-256 455240892c76bf0060ec7f14b59b5f3a9e8cc2a3e9dc7467b420aafcc89d6f0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page