Skip to main content

A fast high-level presentations scraper for Python and command Line

Project description

PPTXER: A fast high-level Presentations Scraper for Python and Command Line

This project is made to make it as easy as possible to scrape presentations (pptx files) from the internet and extract their text (body and notes). It can be used in Python or command line.

Installation

pip install pptxer

Verify installation by running

pptxer --version

Downloader

This mode scrapes presentations that contains a specific keywords from search engines, and downloads them to a directory. The texts within these files will be extracted automatically unless otherwise specified.

  • To download pptx files that contain "COVID-19 Safety" and "Contagious diseases" to directory test_dir
    • Python
    from pptxer.presentations_downloader import scrape_presentations_to_dir
    
    search_keywords = ["COVID-19 Safety", "Contagious diseases"]
    # If download_dir_path is skipped, then a directory with search keywords splitted by "_" will be created
    paths_to_files = scrape_presentations_to_dir(search_keywords, download_dir_path="test_dir")
    # For this example, a directory with name "test_dir" will be created, and files will be written to it
    
    • Command line
      # This will download presentations to test_dir and extract their texts to a json file
      pptxer download "COVID-19 Safety" "Contagious diseases" --dst test_dir
      # To only download
      pptxer download "COVID-19 Safety" "Contagious diseases" --dst test_dir --no-extract-text
    

Extractor

This mode extracts texts from pptx files and outputs a dict with each slide body and note texts. If command line is used then a json file will be outputted.

  • To extract text from presentation files (pptx) or loop through presentation files within a directory
    • Python
    # Single file
    texts = extract_presentations_texts(["directory/test.pptx"])
    
    # Directory. Will scan the directory for pptx file, extract their texts and return them
    texts = extract_presentations_texts(["directory/"])
    
    # Combined file and directory
    texts = extract_presentations_texts(["directory/", "directory2/test.pptx"])
    
    • Command line
    # Single file
    pptxer extract directory/test.pptx
    # Directory
    pptxer extract directory/
    # File and directory
    pptxer extract directory1 directory2/test.pptx
    
    The output will be similar to the following:
[{
'path': 'test.pptx', 
'slides': [
            {'noteText': 'Note Line 1\nNote Line 2', 'bodyText': 'Label Test 1Body Line 1\nBody Line 2'}, 
            {'noteText': 'Note Line 1\nNote Line2', 'bodyText': ''}], 
            'bodyTextLengthStats': {'totalLength': 35, 'avgLength': 17.5, 'minLength': 0, 'maxLength': 35, 'medianLength': 17.5}, 
            'noteTextLengthStats': {'totalLength': 45, 'avgLength': 22.5, 'minLength': 22, 'maxLength': 23, 'medianLength': 22.5}
}]

Rate Limit

As of now, we're using third-party search engines to look up files, and almost all search engines throttle or soft ban if they detected automated queries coming from your IP. The soft ban usually lasts about a day, and you will not be able to use pptxer in meanwhile, but you can use any search engines on your browser normally.

Issues

Feel free to open an issue if you have any problems.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pptxer-0.1.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

pptxer-0.1-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file pptxer-0.1.tar.gz.

File metadata

  • Download URL: pptxer-0.1.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pptxer-0.1.tar.gz
Algorithm Hash digest
SHA256 92915301d664508b26478d3617bc14182cb676af8af36c1d947187614c36eb78
MD5 d4a6122ed3a99f0e97e9dfb8780a4834
BLAKE2b-256 cbbc5c791eed13d9725ccd0efabab5a1253c81dc588742c886b68faa761a3a37

See more details on using hashes here.

File details

Details for the file pptxer-0.1-py3-none-any.whl.

File metadata

  • Download URL: pptxer-0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.4.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pptxer-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dccc077a762054b42ba0bcd0fcac2040c963cb0eaa06bbbc9ba847e9914363a0
MD5 3109e42705eef8bcb90c307701e9eaa2
BLAKE2b-256 9dd5aa26e48a7fb9a65cc01434133e2afcf3e33176fdec9182e166dc0bb49d2f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page