A Python library to parse text out of any office file. Currently supports docx, pptx, xlsx, odt, odp, ods, pdf files.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

officeparserpy

A Python library to parse text out of any office file.

Supported File Types

Install via pip

pip install officeparserpy

Library Usage

from officeparserpy import parse_office

# USING FILE BUFFERS
# instead of file path, you can also pass file buffers of one of the supported files
# on parse_office function.

# get file buffers
file_buffers = open("/path/to/officeFile", "rb").read()
# get parsed text from officeparserpy
# NOTE: Only works with parse_office. Private functions are not supported.
data = parse_office(file_buffers)
print(data)

Configuration Object: OfficeParserConfig

Optionally add a config object as a parameter to parse_office for the following configurations

Flag	DataType	Default	Explanation
temp_files_location	string	officeparser_temp	The directory where officeparserpy stores the temp files. The final decompressed data will be put inside the officeparser_temp folder within your directory. Please ensure that this directory actually exists. Default is officeparser_temp.
preserve_temp_files	boolean	False	Flag to not delete the internal content files and the possible duplicate temp files that it uses after unzipping office files. Default is False. It always deletes all of those files.
output_error_to_console	boolean	False	Flag to show all the logs to the console in case of an error. Default is False.
newline_delimiter	string	'\n'	The delimiter used for every new line in places that allow multiline text like word. Default is '\n'.
ignore_notes	boolean	False	Flag to ignore notes from parsing in files like PowerPoint. Default is False. It includes notes in the parsed text by default.
put_notes_at_last	boolean	False	Flag, if set to True, will collectively put all the parsed text from notes at last in files like PowerPoint. Default is False. It puts each note right after its main slide content. If ignore_notes is set to True, this flag is also ignored.

Exception Types

officeparserpy can raise the following exceptions:

Exception Type	Description
FileCorrupted	Raised when the file is corrupted.
ExtensionUnsupported	Raised when the file extension is unsupported.
FileDoesNotExist	Raised when the specified file does not exist.
LocationNotFound	Raised when the specified directory location is not reachable.
ImproperArguments	Raised for improper function arguments.
ImproperBuffers	Raised for errors while reading file buffers.

Example

from officeparserpy import parse_office, FileCorrupted, FileDoesNotExist

config = {
    'newline_delimiter': ' ',  # Separate new lines with a space instead of the default '\n'.
    'ignore_notes': True       # Ignore notes while parsing presentation files like pptx or odp.
}

try:
    # relative path is also fine => eg: files/myWorkSheet.ods
    data = parse_office("/Users/harsh/Desktop/files/mySlides.pptx", config)
    new_text = data + " look, I can parse a PowerPoint file"
    call_some_other_function(new_text)

    # Search for a term in the parsed text.
    def search_for_term_in_office_file(search_term, file_path):
        data = parse_office(file_path, config)
        return search_term in data

except FileDoesNotExist as file_not_found_error:
    print(f"Error: {file_not_found_error}")
    # Handle the case where the specified file does not exist.

except FileCorrupted as file_corrupted_error:
    print(f"Error: {file_corrupted_error}")
    # Handle the case where the file is corrupted.

except Exception as generic_error:
    print(f"An unexpected error occurred: {str(generic_error)}")
    # Handle other unexpected errors.

Known Bugs

Inconsistency and incorrectness in the positioning of footnotes and endnotes in .docx files where the footnotes and endnotes would end up at the end of the parsed text, whereas it would be positioned exactly after the referenced word in .odt files.
The charts and objects information of .odt files are not accurate and may end up showing a few NaN in some cases.

pip https://pypi.org/project/officeparserpy/

github https://github.com/harshankur/officeparserpy

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.10

Feb 11, 2024

1.0.9

Feb 7, 2024

1.0.8

Jan 2, 2024

1.0.7

Dec 23, 2023

1.0.6

Dec 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

officeparserpy-1.0.10.tar.gz (13.5 kB view details)

Uploaded Feb 11, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

officeparserpy-1.0.10-py3-none-any.whl (12.4 kB view details)

Uploaded Feb 11, 2024 Python 3

File details

Details for the file officeparserpy-1.0.10.tar.gz.

File metadata

Download URL: officeparserpy-1.0.10.tar.gz
Upload date: Feb 11, 2024
Size: 13.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for officeparserpy-1.0.10.tar.gz
Algorithm	Hash digest
SHA256	`fb1c8aff255069d9062180e50545b591f94566e53ef0a43dde19879e0b9beb89`
MD5	`3b4f1ba5511976e0619e624a8068f7ff`
BLAKE2b-256	`bc378e23ca08534ba840716bf093b7f7fe3542fd1435b543f6635bb69d215f26`

See more details on using hashes here.

File details

Details for the file officeparserpy-1.0.10-py3-none-any.whl.

File metadata

Download URL: officeparserpy-1.0.10-py3-none-any.whl
Upload date: Feb 11, 2024
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for officeparserpy-1.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8bb64daa4c7f0149da0a00bf9fb43a070d189e21529f0a9f17d59e15873f7f27`
MD5	`9f4c35b7ece273c55ff502ceba940917`
BLAKE2b-256	`3b5467eb6b01b60404cebc572ed9c2b612d296de00a2774a5fbbf1a8f0a18f87`

See more details on using hashes here.

officeparserpy 1.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

officeparserpy

Supported File Types

Install via pip

Library Usage

Configuration Object: OfficeParserConfig

Exception Types

Known Bugs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes