GroupDocs.Parser for Python via .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint.
Project description
Advanced Document Parsing API for Python via .NET
Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License
GroupDocs.Parser for Python via .NET is a powerful on-premise document parsing library that lets you extract text, parser, images, attachments, barcodes and structured content from dozens of popular formats – including PDF, Word, Excel, PowerPoint, emails, archives, images and more.
You can embed GroupDocs.Parser into your own Python applications without installing any 3rd-party office suites. GroupDocs also provides free online apps built on top of the same APIs that allow users to parse PDF, Office and other documents right in the browser.
Document Parser API Features
GroupDocs.Parser for Python via .NET provides a single, unified API for advanced document parsing and data extraction:
-
Text extraction
- Extract text from PDF, Word, Excel, PowerPoint, e-books, emails and many other formats.
- Work in accurate or raw text modes depending on your scenario.
- Keep track of pages and logical blocks of text.
-
Preserve structure & formatting
- Retrieve formatted text with font styles, sizes and basic layout information.
- Analyze document structure – paragraphs, lists, headings, table cells, etc.
-
Text search
- Search for specific words or phrases in documents.
- Use advanced search options such as case sensitivity, whole-word matching or regular expressions.
-
OCR text extraction
- Extract text from scanned PDFs and raster images using OCR options.
- Combine OCR with spell-checking in supported environments for better recognition quality.
-
Parser extraction
- Read common parser properties like author, title, subject and keywords.
- Extract creation / modification dates and other technical properties.
- Retrieve custom fields such as invoice numbers or business IDs.
-
Image & attachment extraction
- Extract embedded images from Office documents, PDFs, e-books and more.
- Pull file attachments from PDFs and email messages.
- Extract barcodes from supported document and image formats.
-
Document structure analysis
- Parse tables, including rows, columns and individual cells.
- Detect text areas and content blocks for fine-grained extraction.
- Extract hyperlinks, bookmarks and table of contents (TOC) where supported.
-
PDF-specific parsing
- Extract text, images, parser and attachments from PDFs.
- Get PDF page count and PDF-specific document information.
- Work with bookmarks, forms and PDF portfolios.
-
Email parsing
- Extract sender, recipients, subject and body from emails.
- Get email parser and embedded attachments.
- Work with formats like MSG, EML, EMLX, PST and OST.
-
Spreadsheet parsing
- Extract text and data from Excel and other spreadsheet formats.
- Work with specific sheets, ranges or individual cells.
- Extract spreadsheet parser and images.
-
Presentation parsing
- Extract text, notes, images and parser from PowerPoint files.
- Work with slide-by-slide content, including shapes and notes.
-
Template-based data extraction
- Define parsing templates to extract structured fields (e.g. invoices, receipts).
- Use templates to describe positions of fields, tables and patterns.
- Apply your own parsing rules for domain-specific scenarios.
-
Advanced & batch features
- High-performance processing for large documents and document batches.
- Cross-platform support (Windows, Linux, macOS) via .NET runtime.
- Build scalable, secure parsing workflows in your Python applications.
Supported Document Formats
GroupDocs.Parser for Python via .NET supports a wide range of document families. Below is an overview of the most important ones.
Word Processing
- DOC, DOT – Microsoft Word binary documents & templates
- DOCX, DOCM, DOTX, DOTM – Office Open XML documents & templates
- RTF – Rich Text Format
- TXT – Plain text
- ODT, OTT – OpenDocument text documents & templates
Typical operations: text extraction (accurate & raw), structured text parsing, text areas, parser, images, attachments, TOC, barcode scanning.
- PDF – Portable Document Format
Operations: template-based parsing, accurate & raw text extraction, text areas, parser, images, attachments/containers, forms, TOC, barcode scanning.
Markup
- XHTML – Extensible Hypertext Markup Language
- MHTML – MIME HTML
- MD – Markdown
- XML – XML files
Operations: text extraction (including formatted text for supported types) and parser extraction.
eBook
- CHM – Compiled HTML Help
- EPUB – Digital e-book format
- FB2 – FictionBook 2.0
- MOBI, AZW3 – Mobile/Kindle formats
Operations: text extraction, structured text, parser, containers, TOC support for selected formats, barcode scanning for supported types.
Spreadsheets
- XLS, XLT, XLSX, XLSM, XLSB
- XLTX, XLTM
- ODS, OTS – OpenDocument spreadsheets
- CSV – Comma-Separated Values
- XLA, XLAM – add-ins
- NUMBERS – Apple iWork Numbers
Operations: text & data extraction, structured content, text areas, parser, images, containers/attachments.
Presentations
- PPT, PPS, POT – binary PowerPoint
- PPTX, PPTM, PPSX, PPSM, POTX, POTM – Office Open XML
- ODP, OTP – OpenDocument presentations
Operations: slide text and notes, structured text, text areas, parser, images, attachments, TOC, barcode scanning.
- PST, OST – Outlook data files
- EML, EMLX, MSG – email messages
Operations: email body text, parser (from/to/subject), attachments, images and containers.
Notes
- ONE – Microsoft OneNote documents
Operations: text extraction and basic parser support.
Archives
- 7Z, ZIP, RAR, TAR, GZ, BZ2
Operations: work with containers – extract inner documents and attachments, including images.
Encrypted 7Z archives are not supported.
Images
- BMP, GIF, JP2, JPG/JPEG, PNG, TIF/TIFF
- DICOM, DJVU, EMF, J2K, PS, PSD, SVG, SVGZ, WEBP, WMF
Operations: text extraction (for some formats via OCR), parser, barcode scanning (where supported).
Databases
- ADO.NET-based data sources and supported database formats
Operations: text and structured data extraction using database-specific options.
Platform Independence
GroupDocs.Parser for Python via .NET can be used to build 32-bit and 64-bit applications for different operating systems, such as Windows, Linux and macOS, where a supported Python 3.x version is installed.
The parsing engine is powered by the same core technology as the GroupDocs.Parser .NET library, giving you production-ready performance and compatibility in Python environments.
Get Started
Ready to try GroupDocs.Parser for Python via .NET?
You can install the Python package from PyPI and reference it in your project. The exact package name and version may depend on the final distribution, but the flow will be similar to other GroupDocs Python via .NET libraries:
Install GroupDocs.Parser for Python via .NET from PyPI
pip install groupdocs.parser-net
Upgrade to the latest version
pip install --upgrade groupdocs.parser-net
Or
Download Package from Official Website
To download the GroupDocs.Parser package for your operating system, please visit the official GroupDocs Releases website. Currently, four OS-specific packages are available:
- Windows 64-bit: Package name ends with
amd64.whl - Windows 32-bit: Package name ends with
win32.whl - Linux 64-bit: Package name ends with
linux1_x86_64.whl - macOS Intel Silicon: Package name ends with
macosx_10_14_x86_64.whl
Choose the appropriate package based on your system's architecture.
Quick Text Extraction Example
The snippet below demonstrates how a typical usage scenario for extracting text from a PDF document might look in Python.
import groupdocs.parser as gp
def run():
# Load the PDF document
with gp.Parser("sample.pdf") as parser:
# Extract text from the document
text = parser.GetText()
# Output the extracted text
print(text)
Extract Images from a Word Document
This example shows how to iterate over images embedded in a Word document and save them to disk.
import groupdocs.parser as gp
def run():
# Load the Word document
with gp.Parser("sample.docx") as parser:
# Get images from the document
images = parser.GetImages()
# Save each image to a PNG file
index = 1
for image in images:
image.Save(f"image{index}.png")
index += 1
GroupDocs.Parser for Python requires you to use python programming language. For Node.js, Java and .NET languages, we recommend you get GroupDocs.Parser for Node.js, GroupDocs.Parser for Java and GroupDocs.Parser for .NET, respectively.
Product Page | Docs | Demos | API Reference | Blog | Search | Free Support | Temporary License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file groupdocs_parser_net-25.12-py3-none-win_amd64.whl.
File metadata
- Download URL: groupdocs_parser_net-25.12-py3-none-win_amd64.whl
- Upload date:
- Size: 218.6 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d74a780079b4cbc27df1b694903f4801e9acf305ed3e4f3b79a8d002ec1e2f38
|
|
| MD5 |
815fa7baa92d978a9222ab7ad02f8833
|
|
| BLAKE2b-256 |
a9fe9f06dabeab3bf7bbb50ae86001b229afd44fbb9185d6326c7625049eccb1
|
File details
Details for the file groupdocs_parser_net-25.12-py3-none-win32.whl.
File metadata
- Download URL: groupdocs_parser_net-25.12-py3-none-win32.whl
- Upload date:
- Size: 213.5 MB
- Tags: Python 3, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c13aa5d48266684e238da69e81e5f92fc13b3178176c824b8750f8c84702a6d8
|
|
| MD5 |
74891db30797f4f456be2d01075f3993
|
|
| BLAKE2b-256 |
5adb8db77dd1c7c18a0cf755f6f7a02b6e9e13131aa6aaffe14bdfa0a6314db5
|
File details
Details for the file groupdocs_parser_net-25.12-py3-none-manylinux1_x86_64.whl.
File metadata
- Download URL: groupdocs_parser_net-25.12-py3-none-manylinux1_x86_64.whl
- Upload date:
- Size: 231.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4d5bcef5ab6ffdb651c3fbe1ba390be544af302de36b49ebc623bf987af5072
|
|
| MD5 |
c82e5b9030e88045b2c0ba2063f9503d
|
|
| BLAKE2b-256 |
15beba6dd4207021ff67363d4e7e07d02c28a34ad5dad024d51ab9996271df2a
|
File details
Details for the file groupdocs_parser_net-25.12-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: groupdocs_parser_net-25.12-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 228.8 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6eb970b01500df654ba98d88157f236241bb4c6fc5cc4392d59f2a7111e45283
|
|
| MD5 |
822f6b55a420a39cd2e30e39e3aa0fea
|
|
| BLAKE2b-256 |
e59a940c05f539d9ed4bc07c6c32818e6584cb675f5acb184217b444c251886c
|
File details
Details for the file groupdocs_parser_net-25.12-py3-none-macosx_10_14_x86_64.whl.
File metadata
- Download URL: groupdocs_parser_net-25.12-py3-none-macosx_10_14_x86_64.whl
- Upload date:
- Size: 234.5 MB
- Tags: Python 3, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
388c40cbb61aebbdb0657ec6db6dc0bc1a69f178467a0288ddef731f757e5e46
|
|
| MD5 |
17934f649c9c1ba530feef93ad23d5af
|
|
| BLAKE2b-256 |
31a31a7683c41ee42bda4ae7439ff83d4f7435645eeeaffd4e7327e4646231d5
|