Skip to main content

GroupDocs.Parser for Python via .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint.

Project description

Advanced Document Parsing API for Python via .NET

banner

Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License

GroupDocs.Parser for Python via .NET is a powerful on-premise document parsing library that lets you extract text, parser, images, attachments, barcodes and structured content from dozens of popular formats – including PDF, Word, Excel, PowerPoint, emails, archives, images and more.

You can embed GroupDocs.Parser into your own Python applications without installing any 3rd-party office suites. GroupDocs also provides free online apps built on top of the same APIs that allow users to parse PDF, Office and other documents right in the browser.

Document Parser API Features

GroupDocs.Parser for Python via .NET provides a single, unified API for advanced document parsing and data extraction:

  • Text extraction

    • Extract text from PDF, Word, Excel, PowerPoint, e-books, emails and many other formats.
    • Work in accurate or raw text modes depending on your scenario.
    • Keep track of pages and logical blocks of text.
  • Preserve structure & formatting

    • Retrieve formatted text with font styles, sizes and basic layout information.
    • Analyze document structure – paragraphs, lists, headings, table cells, etc.
  • Text search

    • Search for specific words or phrases in documents.
    • Use advanced search options such as case sensitivity, whole-word matching or regular expressions.
  • OCR text extraction

    • Extract text from scanned PDFs and raster images using OCR options.
    • Combine OCR with spell-checking in supported environments for better recognition quality.
  • Parser extraction

    • Read common parser properties like author, title, subject and keywords.
    • Extract creation / modification dates and other technical properties.
    • Retrieve custom fields such as invoice numbers or business IDs.
  • Image & attachment extraction

    • Extract embedded images from Office documents, PDFs, e-books and more.
    • Pull file attachments from PDFs and email messages.
    • Extract barcodes from supported document and image formats.
  • Document structure analysis

    • Parse tables, including rows, columns and individual cells.
    • Detect text areas and content blocks for fine-grained extraction.
    • Extract hyperlinks, bookmarks and table of contents (TOC) where supported.
  • PDF-specific parsing

    • Extract text, images, parser and attachments from PDFs.
    • Get PDF page count and PDF-specific document information.
    • Work with bookmarks, forms and PDF portfolios.
  • Email parsing

    • Extract sender, recipients, subject and body from emails.
    • Get email parser and embedded attachments.
    • Work with formats like MSG, EML, EMLX, PST and OST.
  • Spreadsheet parsing

    • Extract text and data from Excel and other spreadsheet formats.
    • Work with specific sheets, ranges or individual cells.
    • Extract spreadsheet parser and images.
  • Presentation parsing

    • Extract text, notes, images and parser from PowerPoint files.
    • Work with slide-by-slide content, including shapes and notes.
  • Template-based data extraction

    • Define parsing templates to extract structured fields (e.g. invoices, receipts).
    • Use templates to describe positions of fields, tables and patterns.
    • Apply your own parsing rules for domain-specific scenarios.
  • Advanced & batch features

    • High-performance processing for large documents and document batches.
    • Cross-platform support (Windows, Linux, macOS) via .NET runtime.
    • Build scalable, secure parsing workflows in your Python applications.

Supported Document Formats

GroupDocs.Parser for Python via .NET supports a wide range of document families. Below is an overview of the most important ones.

Word Processing

  • DOC, DOT – Microsoft Word binary documents & templates
  • DOCX, DOCM, DOTX, DOTM – Office Open XML documents & templates
  • RTF – Rich Text Format
  • TXT – Plain text
  • ODT, OTT – OpenDocument text documents & templates

Typical operations: text extraction (accurate & raw), structured text parsing, text areas, parser, images, attachments, TOC, barcode scanning.

PDF

  • PDF – Portable Document Format

Operations: template-based parsing, accurate & raw text extraction, text areas, parser, images, attachments/containers, forms, TOC, barcode scanning.

Markup

  • XHTML – Extensible Hypertext Markup Language
  • MHTML – MIME HTML
  • MD – Markdown
  • XML – XML files

Operations: text extraction (including formatted text for supported types) and parser extraction.

eBook

  • CHM – Compiled HTML Help
  • EPUB – Digital e-book format
  • FB2 – FictionBook 2.0
  • MOBI, AZW3 – Mobile/Kindle formats

Operations: text extraction, structured text, parser, containers, TOC support for selected formats, barcode scanning for supported types.

Spreadsheets

  • XLS, XLT, XLSX, XLSM, XLSB
  • XLTX, XLTM
  • ODS, OTS – OpenDocument spreadsheets
  • CSV – Comma-Separated Values
  • XLA, XLAM – add-ins
  • NUMBERS – Apple iWork Numbers

Operations: text & data extraction, structured content, text areas, parser, images, containers/attachments.

Presentations

  • PPT, PPS, POT – binary PowerPoint
  • PPTX, PPTM, PPSX, PPSM, POTX, POTM – Office Open XML
  • ODP, OTP – OpenDocument presentations

Operations: slide text and notes, structured text, text areas, parser, images, attachments, TOC, barcode scanning.

Email

  • PST, OST – Outlook data files
  • EML, EMLX, MSG – email messages

Operations: email body text, parser (from/to/subject), attachments, images and containers.

Notes

  • ONE – Microsoft OneNote documents

Operations: text extraction and basic parser support.

Archives

  • 7Z, ZIP, RAR, TAR, GZ, BZ2

Operations: work with containers – extract inner documents and attachments, including images.

Encrypted 7Z archives are not supported.

Images

  • BMP, GIF, JP2, JPG/JPEG, PNG, TIF/TIFF
  • DICOM, DJVU, EMF, J2K, PS, PSD, SVG, SVGZ, WEBP, WMF

Operations: text extraction (for some formats via OCR), parser, barcode scanning (where supported).

Databases

  • ADO.NET-based data sources and supported database formats

Operations: text and structured data extraction using database-specific options.


Platform Independence

GroupDocs.Parser for Python via .NET can be used to build 32-bit and 64-bit applications for different operating systems, such as Windows, Linux and macOS, where a supported Python 3.x version is installed.

The parsing engine is powered by the same core technology as the GroupDocs.Parser .NET library, giving you production-ready performance and compatibility in Python environments.


Get Started

Ready to try GroupDocs.Parser for Python via .NET?

You can install the Python package from PyPI and reference it in your project. The exact package name and version may depend on the final distribution, but the flow will be similar to other GroupDocs Python via .NET libraries:

Install GroupDocs.Parser for Python via .NET from PyPI

pip install groupdocs.parser-net

Upgrade to the latest version

pip install --upgrade groupdocs.parser-net

Or

Download Package from Official Website

To download the GroupDocs.Parser package for your operating system, please visit the official GroupDocs Releases website. Currently, four OS-specific packages are available:

  • Windows 64-bit: Package name ends with amd64.whl
  • Windows 32-bit: Package name ends with win32.whl
  • Linux 64-bit: Package name ends with linux1_x86_64.whl
  • macOS Intel Silicon: Package name ends with macosx_10_14_x86_64.whl

Choose the appropriate package based on your system's architecture.

Quick Text Extraction Example

The snippet below demonstrates how a typical usage scenario for extracting text from a PDF document might look in Python.

import groupdocs.parser as gp

def run():
    # Load the PDF document
    with gp.Parser("sample.pdf") as parser:
        # Extract text from the document
        text = parser.GetText()

        # Output the extracted text
        print(text)

Extract Images from a Word Document

This example shows how to iterate over images embedded in a Word document and save them to disk.

import groupdocs.parser as gp

def run():
    # Load the Word document
    with gp.Parser("sample.docx") as parser:
        # Get images from the document
        images = parser.GetImages()

        # Save each image to a PNG file
        index = 1
        for image in images:
            image.Save(f"image{index}.png")
            index += 1

GroupDocs.Parser for Python requires you to use python programming language. For Node.js, Java and .NET languages, we recommend you get GroupDocs.Parser for Node.js, GroupDocs.Parser for Java and GroupDocs.Parser for .NET, respectively.

Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

groupdocs_parser_net-25.12-py3-none-win_amd64.whl (218.6 MB view details)

Uploaded Python 3Windows x86-64

groupdocs_parser_net-25.12-py3-none-win32.whl (213.5 MB view details)

Uploaded Python 3Windows x86

groupdocs_parser_net-25.12-py3-none-macosx_11_0_arm64.whl (228.8 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

groupdocs_parser_net-25.12-py3-none-macosx_10_14_x86_64.whl (234.5 MB view details)

Uploaded Python 3macOS 10.14+ x86-64

File details

Details for the file groupdocs_parser_net-25.12-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-25.12-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 d74a780079b4cbc27df1b694903f4801e9acf305ed3e4f3b79a8d002ec1e2f38
MD5 815fa7baa92d978a9222ab7ad02f8833
BLAKE2b-256 a9fe9f06dabeab3bf7bbb50ae86001b229afd44fbb9185d6326c7625049eccb1

See more details on using hashes here.

File details

Details for the file groupdocs_parser_net-25.12-py3-none-win32.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-25.12-py3-none-win32.whl
Algorithm Hash digest
SHA256 c13aa5d48266684e238da69e81e5f92fc13b3178176c824b8750f8c84702a6d8
MD5 74891db30797f4f456be2d01075f3993
BLAKE2b-256 5adb8db77dd1c7c18a0cf755f6f7a02b6e9e13131aa6aaffe14bdfa0a6314db5

See more details on using hashes here.

File details

Details for the file groupdocs_parser_net-25.12-py3-none-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-25.12-py3-none-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d4d5bcef5ab6ffdb651c3fbe1ba390be544af302de36b49ebc623bf987af5072
MD5 c82e5b9030e88045b2c0ba2063f9503d
BLAKE2b-256 15beba6dd4207021ff67363d4e7e07d02c28a34ad5dad024d51ab9996271df2a

See more details on using hashes here.

File details

Details for the file groupdocs_parser_net-25.12-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-25.12-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6eb970b01500df654ba98d88157f236241bb4c6fc5cc4392d59f2a7111e45283
MD5 822f6b55a420a39cd2e30e39e3aa0fea
BLAKE2b-256 e59a940c05f539d9ed4bc07c6c32818e6584cb675f5acb184217b444c251886c

See more details on using hashes here.

File details

Details for the file groupdocs_parser_net-25.12-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-25.12-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 388c40cbb61aebbdb0657ec6db6dc0bc1a69f178467a0288ddef731f757e5e46
MD5 17934f649c9c1ba530feef93ad23d5af
BLAKE2b-256 31a31a7683c41ee42bda4ae7439ff83d4f7435645eeeaffd4e7327e4646231d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page