Skip to main content

A comprehensive web content scraping tool for text, images, audio and video

Project description

HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: htmlrun, and show, each serving different purposes in the content scraping and display process. Features

1. html Class

  • Text Extraction: Extracts text content from HTML elements (divp, etc.) and saves it to text files.

  • Image Downloading: Downloads images from specified HTML elements or all images on a page.

  • Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. run Class

  • Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.

  • Video Processing: Handles video content with optional URL prefixing.

3. show Class

  • Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

  • 'br' or 'p': Specifies the extraction mode.

  • url: The starting URL to scrape.

  • class_name: The class of the element containing the text.

  • next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

  • url: The URL of the page containing images.

  • container_class: Optional. The class of the container element holding the images.

  • prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

  • url: The URL of the page containing audio files.

  • container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

  • requests: For making HTTP requests.

  • bs4 (BeautifulSoup): For parsing HTML.

  • lxml: As a parser for BeautifulSoup.

  • PIL (Pillow): For image display.

  • audioplayer: For audio playback.

  • moviepy: For video playback.

Notes

  • Ensure all dependencies are installed before running the script.

  • The script includes error handling with basic try-except blocks.

  • User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.HTML Content Scraper - README

Overview

This Python script provides a set of tools for scraping and processing different types of content from web pages, including text, images, audio, and video. It's organized into three main classes: htmlrun, and show, each serving different purposes in the content scraping and display process. Features

1. html Class

  • Text Extraction: Extracts text content from HTML elements (divp, etc.) and saves it to text files.

  • Image Downloading: Downloads images from specified HTML elements or all images on a page.

  • Audio Downloading: Extracts and downloads audio files from HTML audio elements.

2. run Class

  • Direct Downloads: Downloads media files (music, video, text) directly from provided URLs.

  • Video Processing: Handles video content with optional URL prefixing.

3. show Class

  • Content Display: Displays downloaded content, including text files, images, audio, and video.

Usage

Text Extraction

python

html.txt('br', url, class_name, next_page_class, tag='div', next_tag='div', base_url='https:/', index=None)

  • 'br' or 'p': Specifies the extraction mode.

  • url: The starting URL to scrape.

  • class_name: The class of the element containing the text.

  • next_page_class: The class of the element containing the link to the next page.

Image Downloading

python

html.img(url, container_class=None, prefix=None)

  • url: The URL of the page containing images.

  • container_class: Optional. The class of the container element holding the images.

  • prefix: Optional. A URL prefix to prepend to image sources.

Audio Downloading

python

html.audio(url, container_class=None)

  • url: The URL of the page containing audio files.

  • container_class: Optional. The class of the container element holding the audio files.

Direct Media Download

python

run.music(url, output_name='1') # For audio run.video(url, output_name='1', prefix=None) # For video run.txt(url, output_name='1') # For text

Display Content

python

show.txt('连续', txt, start=1, end=1) # Display multiple text files show.txt('单个', txt) # Display a single text file show.image(img) # Display an image show.music(mp3) # Play audio show.video(mp4) # Play video Dependencies

  • requests: For making HTTP requests.

  • bs4 (BeautifulSoup): For parsing HTML.

  • lxml: As a parser for BeautifulSoup.

  • PIL (Pillow): For image display.

  • audioplayer: For audio playback.

  • moviepy: For video playback.

Notes

  • Ensure all dependencies are installed before running the script.

  • The script includes error handling with basic try-except blocks.

  • User-Agent is set to mimic a Chrome browser to avoid blocking by some websites.

Example

python

Download images from a webpage

html.img('https://example.com/gallery', 'image-container')

Display the first downloaded image

show.image('1')

This tool is useful for scraping and organizing content from web pages efficiently. Adjust parameters as needed for specific use cases.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests_ss-0.1.5.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

requests_ss-0.1.5-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file requests_ss-0.1.5.tar.gz.

File metadata

  • Download URL: requests_ss-0.1.5.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.5.tar.gz
Algorithm Hash digest
SHA256 c462d87e1ea4c80014fa6c9b7752cebe318ff27c693612a98cce72f81726b145
MD5 9c93ab9edc4c2b7a72080be1c48a19b7
BLAKE2b-256 30ab26c6596ef6fe96a9acdc055f3b01bd70941edc06d7bdc38c1d65cbe580af

See more details on using hashes here.

File details

Details for the file requests_ss-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: requests_ss-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for requests_ss-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 33e2327c9aad894fbd40e5b2beeef6e365339ce4a6048a538a0cbe80a6093005
MD5 d1af5eef381483e87df40a41a131c449
BLAKE2b-256 546e9a550a22b400dc87b629d8b430878b5efd236e05f3bb17a2e8f3b7c7f84d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page