Tool for extracting content from office files.
Project description
extract_office_content
Use
- Install
extract_office_content
$ pip install extract_office_content
- Run by CLI.
- Extract All office file's content.
$ extract_office_content -h usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path positional arguments: file_path optional arguments: -h, --help show this help message and exit -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR $ extract_office_content tests/test_files
- Extract Word.
$ extract_word -h usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path positional arguments: word_path optional arguments: -h, --help show this help message and exit -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR $ extract_word tests/test_files/word_example.docx
- Extract PPT.
$ extract_ppt -h usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path positional arguments: ppt_path optional arguments: -h, --help show this help message and exit -img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR $ extract_ppt tests/test_files/ppt_example.pptx
- Extract Excel.
$ extract_excel -h usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR] excel_path positional arguments: excel_path optional arguments: -h, --help show this help message and exit -f {markdown,html,latex,string}, --output_format {markdown,html,latex,string} -o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR $ extract_excel tests/test_files/excel_example.xlsx
- Extract All office file's content.
- Run by python script.
- Extract all.
from pathlib import Path from extract_office_content import ExtractOfficeContent extracter = ExtractOfficeContent() file_list = list(Path('tests/test_files').iterdir()) for file_path in file_list: res = extracter(file_path) print(res)
- Extract Word.
from extract_office_content import ExtractWord word_extract = ExtractWord() word_path = 'tests/test_files/word_example.docx' text = word_extract(word_path, "outputs/word") # or bytes with open(word_path, 'rb') as f: word_content = f.read() text = word_extract(word_content, "outputs/word") print(text)
- Extract PPT.
from pathlib import Path from extract_office_content import ExtractPPT ppt_extracter = ExtractPPT() ppt_path = 'tests/test_files/ppt_example.pptx' save_dir = 'outputs' save_img_dir = Path(save_dir) / Path(ppt_path).stem res = ppt_extracter(ppt_path, save_img_dir=str(save_img_dir)) # or bytes with open(ppt_path, 'rb') as f: ppt_content = f.read() res = ppt_extracter(ppt_content, save_img_dir=str(save_img_dir)) print(res)
- Extract Excel.
from extract_office_content import ExtractExcel excel_extract = ExtractExcel() excel_path = 'tests/test_files/excel_with_image.xlsx' res = excel_extract(excel_path, out_format='markdown', save_img_dir='1') # or with open(excel_path, 'rb') as f: excel_content = f.read() res = excel_extract(excel_content, out_format='markdown', save_img_dir='1') print(res)
- Extract all.
See details for ExtractOfficeContent.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file extract_office_content-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: extract_office_content-0.0.7-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d5bc5f7bb5a8dc90d596d95bdd65aa43787b060345626130f23a24e58b98c85 |
|
MD5 | a432cd4131525238012b9bea7014a37e |
|
BLAKE2b-256 | 00a83c3de77223cba5b5bafa9dc8f2fed86b0c2fc994ef493c55f249072c9c44 |