Skip to main content

Tool for extracting content from office files.

Project description

extract_office_content

PyPI

Use

  1. Installextract_office_content
    $ pip install extract_office_content
    
  2. Run by CLI.
    • Extract All office file's content.
      $ extract_office_content -h
      usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path
      
      positional arguments:
      file_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_office_content tests/test_files
      
    • Extract Word.
      $ extract_word -h
      usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path
      
      positional arguments:
      word_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_word tests/test_files/word_example.docx
      
    • Extract PPT.
      $ extract_ppt -h
      usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path
      
      positional arguments:
      ppt_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_ppt tests/test_files/ppt_example.pptx
      
    • Extract Excel.
      $ extract_excel -h
      usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
                          excel_path
      
      positional arguments:
      excel_path
      
      optional arguments:
      -h, --help            show this help message and exit
      -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
      -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
      
      $ extract_excel tests/test_files/excel_example.xlsx
      
  3. Run by python script.
    • Extract all.
      from pathlib import Path
      from extract_office_content import ExtractOfficeContent
      
      
      extracter = ExtractOfficeContent()
      
      file_list = list(Path('tests/test_files').iterdir())
      
      for file_path in file_list:
          res = extracter(file_path)
          print(res)
      
    • Extract Word.
      from extract_office_content import ExtractWord
      
      word_extract = ExtractWord()
      
      word_path = 'tests/test_files/word_example.docx'
      text = word_extract(word_path, "outputs/word")
      
      # or bytes
      with open(word_path, 'rb') as f:
          word_content = f.read()
      text = word_extract(word_content, "outputs/word")
      print(text)
      
    • Extract PPT.
      from pathlib import Path
      
      from extract_office_content import ExtractPPT
      
      ppt_extracter = ExtractPPT()
      
      ppt_path = 'tests/test_files/ppt_example.pptx'
      save_dir = 'outputs'
      save_img_dir = Path(save_dir) / Path(ppt_path).stem
      res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir))
      
      # or bytes
      with open(ppt_path, 'rb') as f:
          ppt_content = f.read()
      res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir))
      print(res)
      
    • Extract Excel.
      from extract_office_content import ExtractExcel
      
      excel_extract = ExtractExcel()
      
      excel_path = 'tests/test_files/excel_with_image.xlsx'
      res  = excel_extract(excel_path, out_format='markdown', save_img_dir='1')
      
      # or
      with open(excel_path, 'rb') as f:
          excel_content = f.read()
      res  = excel_extract(excel_content, out_format='markdown', save_img_dir='1')
      print(res)
      

See details for ExtractOfficeContent.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

extract_office_content-0.0.7-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file extract_office_content-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for extract_office_content-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8d5bc5f7bb5a8dc90d596d95bdd65aa43787b060345626130f23a24e58b98c85
MD5 a432cd4131525238012b9bea7014a37e
BLAKE2b-256 00a83c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page