Document parsing and data extraction library that lets you extract text, images, attachments, barcodes and structured content from popular formats such as PDF, Word, Excel, PowerPoint, emails, archives, images and more.
Project description
Product Home | Docs | Live Demos | API Reference | Blog | Search | Free Support Forum | Temporary License
Table of Contents
- About
- Quick Example
- Key Features
- Supported Document Formats
- Feature Support Matrix
- Getting Started
- Licensing
- Support
- Additional Resources
About
GroupDocs.Parser for Python via .NET is a document parsing and data extraction library that lets you extract text, images, attachments, barcodes and structured content from popular formats such as PDF, Word, Excel, PowerPoint, emails, archives, images and more.
Quick Example: Extract Text from a PDF
Use a few lines of Python code to extract text from PDF and Office documents:
from groupdocs.parser import Parser
# Create a Parser instance for your document
with Parser("sample.pdf") as parser:
# Extract text from the document
text = parser.GetText()
# Print all extracted text to the console
print(text)
Key Features
GroupDocs.Parser for Python via .NET provides a single, unified API for advanced document parsing and data extraction:
- Rich text extraction & search – Extract plain or formatted text from PDFs, Office documents, emails, e‑books, archives and more, with page‑level access and advanced search options (case‑sensitive, whole‑word, regex).
- Structured content & templates – Parse document structure (headings, paragraphs, tables, text areas) and use templates to pull out strongly‑typed fields from invoices, receipts and other business documents.
- Images, attachments & barcodes – Extract embedded images, file attachments and barcodes from supported document and image formats.
- OCR for scanned documents – Use OCR to read text from scanned PDFs and raster images, optionally combining it with spell‑checking for better recognition quality.
- Wide format & platform support – Work with dozens of document, image and archive formats on Windows, Linux and macOS using the .NET‑powered parsing engine from your Python code.
Supported Document Formats
GroupDocs.Parser for Python via .NET supports a wide range of document families including:
- Word processing – DOC, DOCX, RTF, TXT, ODT and others
- PDF & markup – PDF, HTML/MHTML, Markdown, XML
- Spreadsheets – XLS, XLSX, ODS, CSV and related formats
- Presentations – PPT, PPTX, ODP and similar formats
- Email & notes – PST, OST, EML, MSG, ONE
- eBooks & web content – EPUB, MOBI, AZW3, CHM, FB2
- Images – JPEG, PNG, TIFF, GIF, BMP, SVG and more
- Archives & containers – ZIP, RAR, 7Z, TAR, GZ, BZ2
See the complete list of supported document formats.
Feature Support Matrix
GroupDocs.Parser for Python via .NET supports different features across document formats. Here's a quick overview:
| Feature | Word | Excel | PowerPoint | eBooks | ||
|---|---|---|---|---|---|---|
| Text Extraction | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Table Extraction | ✅ | ✅ | ✅ | ✅ | ❌ | ⚠️ |
| Image Extraction | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Hyperlink Extraction | ✅ | ✅ | ❌ | ❌ | ❌ | ⚠️ |
| Barcode Scanning | ✅ | ✅ | ✅ | ✅ | ❌ | ⚠️ |
| Metadata Extraction | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Table of Contents | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| Template Parsing | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
✅ Fully Supported | ⚠️ Limited Support | ❌ Not Supported
Getting Started
Prerequisites
- Python 3.5+
- Windows, Linux, or macOS
Learn more about system requirements.
Installation
You can install GroupDocs.Parser for Python via .NET from PyPI or download it from the official website.
Install from PyPI
pip install groupdocs-parser-net
Upgrade to the latest version
pip install --upgrade groupdocs-parser-net
Download from the official website
To download the GroupDocs.Parser package for your operating system, please visit the official GroupDocs Releases website and choose the appropriate package based on your system's architecture.
Learn more about installation.
Use Cases
Beyond basic text extraction, here are the most common use cases for quick text, image and metadata extraction.
📁 Code Examples: For complete, runnable examples with sample files, check out the GroupDocs.Parser for Python via .NET - Code Examples repository. See how to run code examples for more details.
Search Text in a Document
This example shows how to search for a specific phrase in a PDF document and print where it was found.
from groupdocs.parser import Parser
# Load the PDF document
with Parser("sample.pdf") as parser:
# Search for a phrase in the document
for area in parser.Search("Total Amount"):
# Print page index and rectangle where phrase was found
print(f"Page {area.PageIndex}, Rectangle: {area.Rectangle}")
Extract Document Images
This example shows how to iterate over images embedded in a Word document and save them to disk.
from groupdocs.parser import Parser
# Load the Word document
with Parser("sample.docx") as parser:
# Get images from the document
images = parser.GetImages()
# Save each image to a PNG file
index = 1
for image in images:
image.Save(f"image_{index}.png")
index += 1
Extract Document Metadata
This example shows how to read basic metadata such as author, creation date and other properties from a document.
from groupdocs.parser import Parser
# Load the document
with Parser("sample.pdf") as parser:
# Get document metadata
metadata = parser.GetMetadata()
# Print all metadata items
for item in metadata:
print(f"{item.Name}: {item.Value}")
Licensing
For testing without trial limitations, you can request a 30-day Temporary License:
- Visit the Get a Temporary License page
- Follow the instructions to request your temporary license
- Copy the license file and apply it using the code example
import os
from groupdocs.parser import License
# Get absolute path to license file
license_path = os.path.abspath("./GroupDocs.Parser.lic")
# Instantiate License and set the license
license = License()
license.set_license(license_path)
This product is licensed under the GroupDocs End User License Agreement (EULA). For pricing information, visit the GroupDocs.Parser for Python via .NET pricing page.
Support
GroupDocs provides unlimited free technical support for all of its products. Support is available to all users, including evaluation. The support is provided at Free Support Forum, Paid Support Helpdesk and Paid Consulting.
Free Support Forum
The GroupDocs Free Support Forum is available to all users and provides:
- Direct access to the GroupDocs development team
- Community-driven support and knowledge sharing
- No time limitations on support requests
- Access to historical solutions and discussions
Paid Support Helpdesk
The Paid Support Helpdesk offers:
- Higher priority response times
- Dedicated support team
- Extended support hours
- Priority issue resolution
Paid Consulting
We can work together with you on your project and develop a part or complete application. If you need new features in the existing GroupDocs product or to create API for new file formats, send us a request at consulting.groupdocs.com/contact.
Additional Resources
- Documentation – Complete API documentation and guides
- API Reference – Detailed API reference documentation
- Live Demos – Interactive online demos
- Code Examples – Working code examples with sample files
- Blog – Latest updates and tutorials
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file groupdocs_parser_net-0.0.0.tar.gz.
File metadata
- Download URL: groupdocs_parser_net-0.0.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff1a051e5427a5991c750cf5eae8328f77ad090e36099a525e5d9b23160600e7
|
|
| MD5 |
bed8f5bbb2bfdac468d6a7d44bf3a002
|
|
| BLAKE2b-256 |
77e1d9f9e2dcddaf8077f27cd0382fa37701eb2a7bd2a851967696acf94e688a
|
File details
Details for the file groupdocs_parser_net-0.0.0-py3-none-any.whl.
File metadata
- Download URL: groupdocs_parser_net-0.0.0-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cc2f87fa288671a8726f45c6a03e34f3fe257f65cfcc1a7861e445f78bdccfc
|
|
| MD5 |
bd52670aa348b48ed7a367049b354391
|
|
| BLAKE2b-256 |
f41ac76bdbfc29a11a0c92f5b3375288cddfdf6692062606b8e626a0e81f986f
|