Skip to main content

Document parsing and data extraction library that lets you extract text, images, attachments, barcodes and structured content from popular formats such as PDF, Word, Excel, PowerPoint, emails, archives, images and more.

Project description

GroupDocs.Parser for Python via .NET Banner

Product Home | Docs | Live Demos | API Reference | Blog | Search | Free Support Forum | Temporary License

Table of Contents

About

GroupDocs.Parser for Python via .NET is a document parsing and data extraction library that lets you extract text, images, attachments, barcodes and structured content from popular formats such as PDF, Word, Excel, PowerPoint, emails, archives, images and more.

Quick Example: Extract Text from a PDF

Use a few lines of Python code to extract text from PDF and Office documents:

from groupdocs.parser import Parser

# Create a Parser instance for your document
with Parser("sample.pdf") as parser:
    # Extract text from the document
    text = parser.GetText()
    
    # Print all extracted text to the console
    print(text)

Key Features

GroupDocs.Parser for Python via .NET provides a single, unified API for advanced document parsing and data extraction:

  • Rich text extraction & search – Extract plain or formatted text from PDFs, Office documents, emails, e‑books, archives and more, with page‑level access and advanced search options (case‑sensitive, whole‑word, regex).
  • Structured content & templates – Parse document structure (headings, paragraphs, tables, text areas) and use templates to pull out strongly‑typed fields from invoices, receipts and other business documents.
  • Images, attachments & barcodes – Extract embedded images, file attachments and barcodes from supported document and image formats.
  • OCR for scanned documents – Use OCR to read text from scanned PDFs and raster images, optionally combining it with spell‑checking for better recognition quality.
  • Wide format & platform support – Work with dozens of document, image and archive formats on Windows, Linux and macOS using the .NET‑powered parsing engine from your Python code.

Supported Document Formats

GroupDocs.Parser for Python via .NET supports a wide range of document families including:

  • Word processing – DOC, DOCX, RTF, TXT, ODT and others
  • PDF & markup – PDF, HTML/MHTML, Markdown, XML
  • Spreadsheets – XLS, XLSX, ODS, CSV and related formats
  • Presentations – PPT, PPTX, ODP and similar formats
  • Email & notes – PST, OST, EML, MSG, ONE
  • eBooks & web content – EPUB, MOBI, AZW3, CHM, FB2
  • Images – JPEG, PNG, TIFF, GIF, BMP, SVG and more
  • Archives & containers – ZIP, RAR, 7Z, TAR, GZ, BZ2

See the complete list of supported document formats.

Feature Support Matrix

GroupDocs.Parser for Python via .NET supports different features across document formats. Here's a quick overview:

Feature PDF Word Excel PowerPoint Email eBooks
Text Extraction
Table Extraction ⚠️
Image Extraction
Hyperlink Extraction ⚠️
Barcode Scanning ⚠️
Metadata Extraction
Table of Contents
Template Parsing

✅ Fully Supported | ⚠️ Limited Support | ❌ Not Supported


Getting Started

Prerequisites

  • Python 3.5+
  • Windows, Linux, or macOS

Learn more about system requirements.

Installation

You can install GroupDocs.Parser for Python via .NET from PyPI or download it from the official website.

Install from PyPI

pip install groupdocs-parser-net

Upgrade to the latest version

pip install --upgrade groupdocs-parser-net

Download from the official website

To download the GroupDocs.Parser package for your operating system, please visit the official GroupDocs Releases website and choose the appropriate package based on your system's architecture.

Learn more about installation.


Use Cases

Beyond basic text extraction, here are the most common use cases for quick text, image and metadata extraction.

📁 Code Examples: For complete, runnable examples with sample files, check out the GroupDocs.Parser for Python via .NET - Code Examples repository. See how to run code examples for more details.

Search Text in a Document

This example shows how to search for a specific phrase in a PDF document and print where it was found.

from groupdocs.parser import Parser

# Load the PDF document
with Parser("sample.pdf") as parser:
    # Search for a phrase in the document
    for area in parser.Search("Total Amount"):
        # Print page index and rectangle where phrase was found
        print(f"Page {area.PageIndex}, Rectangle: {area.Rectangle}")

Extract Document Images

This example shows how to iterate over images embedded in a Word document and save them to disk.

from groupdocs.parser import Parser

# Load the Word document
with Parser("sample.docx") as parser:
    # Get images from the document
    images = parser.GetImages()

    # Save each image to a PNG file
    index = 1
    for image in images:
        image.Save(f"image_{index}.png")
        index += 1

Extract Document Metadata

This example shows how to read basic metadata such as author, creation date and other properties from a document.

from groupdocs.parser import Parser

# Load the document
with Parser("sample.pdf") as parser:
    # Get document metadata
    metadata = parser.GetMetadata()

    # Print all metadata items
    for item in metadata:
        print(f"{item.Name}: {item.Value}")

Licensing

For testing without trial limitations, you can request a 30-day Temporary License:

  • Visit the Get a Temporary License page
  • Follow the instructions to request your temporary license
  • Copy the license file and apply it using the code example
import os
from groupdocs.parser import License

# Get absolute path to license file
license_path = os.path.abspath("./GroupDocs.Parser.lic")

# Instantiate License and set the license
license = License()
license.set_license(license_path)

This product is licensed under the GroupDocs End User License Agreement (EULA). For pricing information, visit the GroupDocs.Parser for Python via .NET pricing page.


Support

GroupDocs provides unlimited free technical support for all of its products. Support is available to all users, including evaluation. The support is provided at Free Support Forum, Paid Support Helpdesk and Paid Consulting.

Free Support Forum

The GroupDocs Free Support Forum is available to all users and provides:

  • Direct access to the GroupDocs development team
  • Community-driven support and knowledge sharing
  • No time limitations on support requests
  • Access to historical solutions and discussions

Paid Support Helpdesk

The Paid Support Helpdesk offers:

  • Higher priority response times
  • Dedicated support team
  • Extended support hours
  • Priority issue resolution

Paid Consulting

We can work together with you on your project and develop a part or complete application. If you need new features in the existing GroupDocs product or to create API for new file formats, send us a request at consulting.groupdocs.com/contact.


Additional Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

groupdocs_parser_net-0.0.0.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

groupdocs_parser_net-0.0.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file groupdocs_parser_net-0.0.0.tar.gz.

File metadata

  • Download URL: groupdocs_parser_net-0.0.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for groupdocs_parser_net-0.0.0.tar.gz
Algorithm Hash digest
SHA256 ff1a051e5427a5991c750cf5eae8328f77ad090e36099a525e5d9b23160600e7
MD5 bed8f5bbb2bfdac468d6a7d44bf3a002
BLAKE2b-256 77e1d9f9e2dcddaf8077f27cd0382fa37701eb2a7bd2a851967696acf94e688a

See more details on using hashes here.

File details

Details for the file groupdocs_parser_net-0.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for groupdocs_parser_net-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6cc2f87fa288671a8726f45c6a03e34f3fe257f65cfcc1a7861e445f78bdccfc
MD5 bd52670aa348b48ed7a367049b354391
BLAKE2b-256 f41ac76bdbfc29a11a0c92f5b3375288cddfdf6692062606b8e626a0e81f986f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page