Skip to main content

A Python library to parse text out of any office file. Currently supports docx, pptx, xlsx, odt, odp, ods, pdf files.

Project description

officeparserpy

A Python library to parse text out of any office file.

Supported File Types

Install via pip

pip install officeparserpy

Library Usage

from officeparserpy import parse_office

# USING FILE BUFFERS
# instead of file path, you can also pass file buffers of one of the supported files
# on parse_office function.

# get file buffers
file_buffers = open("/path/to/officeFile", "rb").read()
# get parsed text from officeparserpy
# NOTE: Only works with parse_office. Private functions are not supported.
data = parse_office(file_buffers)
print(data)

Configuration Object: OfficeParserConfig

Optionally add a config object as a parameter to parse_office for the following configurations

Flag DataType Default Explanation
temp_files_location string officeparser_temp The directory where officeparserpy stores the temp files. The final decompressed data will be put inside the officeparser_temp folder within your directory. Please ensure that this directory actually exists. Default is officeparser_temp.
preserve_temp_files boolean False Flag to not delete the internal content files and the possible duplicate temp files that it uses after unzipping office files. Default is False. It always deletes all of those files.
output_error_to_console boolean False Flag to show all the logs to the console in case of an error. Default is False.
newline_delimiter string '\n' The delimiter used for every new line in places that allow multiline text like word. Default is '\n'.
ignore_notes boolean False Flag to ignore notes from parsing in files like PowerPoint. Default is False. It includes notes in the parsed text by default.
put_notes_at_last boolean False Flag, if set to True, will collectively put all the parsed text from notes at last in files like PowerPoint. Default is False. It puts each note right after its main slide content. If ignore_notes is set to True, this flag is also ignored.

Exception Types

officeparserpy can raise the following exceptions:

Exception Type Description
FileCorrupted Raised when the file is corrupted.
ExtensionUnsupported Raised when the file extension is unsupported.
FileDoesNotExist Raised when the specified file does not exist.
LocationNotFound Raised when the specified directory location is not reachable.
ImproperArguments Raised for improper function arguments.
ImproperBuffers Raised for errors while reading file buffers.

Example

from officeparserpy import parse_office, FileCorrupted, FileDoesNotExist

config = {
    'newline_delimiter': ' ',  # Separate new lines with a space instead of the default '\n'.
    'ignore_notes': True       # Ignore notes while parsing presentation files like pptx or odp.
}

try:
    # relative path is also fine => eg: files/myWorkSheet.ods
    data = parse_office("/Users/harsh/Desktop/files/mySlides.pptx", config)
    new_text = data + " look, I can parse a PowerPoint file"
    call_some_other_function(new_text)

    # Search for a term in the parsed text.
    def search_for_term_in_office_file(search_term, file_path):
        data = parse_office(file_path, config)
        return search_term in data

except FileDoesNotExist as file_not_found_error:
    print(f"Error: {file_not_found_error}")
    # Handle the case where the specified file does not exist.

except FileCorrupted as file_corrupted_error:
    print(f"Error: {file_corrupted_error}")
    # Handle the case where the file is corrupted.

except Exception as generic_error:
    print(f"An unexpected error occurred: {str(generic_error)}")
    # Handle other unexpected errors.

Known Bugs

  1. Inconsistency and incorrectness in the positioning of footnotes and endnotes in .docx files where the footnotes and endnotes would end up at the end of the parsed text, whereas it would be positioned exactly after the referenced word in .odt files.
  2. The charts and objects information of .odt files are not accurate and may end up showing a few NaN in some cases.

pip https://pypi.org/project/officeparserpy/

github https://github.com/harshankur/officeparserpy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

officeparserpy-1.0.10.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

officeparserpy-1.0.10-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file officeparserpy-1.0.10.tar.gz.

File metadata

  • Download URL: officeparserpy-1.0.10.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for officeparserpy-1.0.10.tar.gz
Algorithm Hash digest
SHA256 fb1c8aff255069d9062180e50545b591f94566e53ef0a43dde19879e0b9beb89
MD5 3b4f1ba5511976e0619e624a8068f7ff
BLAKE2b-256 bc378e23ca08534ba840716bf093b7f7fe3542fd1435b543f6635bb69d215f26

See more details on using hashes here.

File details

Details for the file officeparserpy-1.0.10-py3-none-any.whl.

File metadata

File hashes

Hashes for officeparserpy-1.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8bb64daa4c7f0149da0a00bf9fb43a070d189e21529f0a9f17d59e15873f7f27
MD5 9f4c35b7ece273c55ff502ceba940917
BLAKE2b-256 3b5467eb6b01b60404cebc572ed9c2b612d296de00a2774a5fbbf1a8f0a18f87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page