Text extraction library for typical file formats found in SharePoint repositories

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

horsmann

These details have not been verified by PyPI

Project description

sharepoint-to-text

A pure-Python library for extracting text and structured content from files commonly found in SharePoint ecosystems:

Microsoft Office (modern and legacy)
OpenDocument
PDF
Email formats
Plain text and config formats
HTML/EPUB/MHTML
Archives containing supported files

It also includes an optional SharePoint Graph client (sharepoint_io) for listing/downloading files before extraction.

Why Use This Library
Install
Quick Start
Core Interface
CLI
Optional SharePoint Integration
Supported Formats
Archive Processing and Security
Limitations and Caveats
API Cheat Sheet
Exceptions
License
Disclaimer
More Usage Examples

Why Use This Library

Pure Python (no Java runtime, no LibreOffice subprocesses)
Unified extraction interface across many file types
Works with file paths and in-memory bytes
Suitable for RAG/indexing pipelines where chunking and metadata matter
Handles both modern and legacy Office formats in one API

Install

uv add sharepoint-to-text

Optional PDF crypto acceleration:

uv add "sharepoint-to-text[pdf-crypto]"

From source:

git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups

Quick Start

1) Read any supported local file

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())

read_file(...) returns a generator. Most files produce one result, but archives and .mbox can produce multiple.

2) Read bytes already in memory

import sharepoint2text

payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())

3) Choose chunking strategy

import sharepoint2text

result = next(sharepoint2text.read_file("report.pdf"))

# Single text blob
full_text = result.get_full_text()

# Structured chunks (page/slide/sheet depending on format)
for unit in result.iterate_units():
    print(unit.get_text())
    print(unit.get_metadata())

4) Serialize results

import json
import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))

Restore from JSON:

from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(result.to_json())

Core Interface

All extracted results implement a common interface (ExtractionInterface):

get_full_text()
iterate_units()
iterate_images()
iterate_tables()
get_metadata()
to_json() / from_json(...)

Use this interface when you want one pipeline that works across formats.

Which text method should you use?

Goal	Method
One string per document	`get_full_text()`
Chunk by structure (RAG/citations)	`iterate_units()`
All images in a file	`iterate_images()`
All tables in a file	`iterate_tables()`

What `iterate_units()` means by format

Format family	Units yielded
Word / text docs (`.docx`, `.doc`, `.odt`, plain text, config files)	Usually one unit
Spreadsheets (`.xlsx`, `.xls`, `.ods`)	One unit per sheet
Presentations (`.pptx`, `.ppt`, `.odp`)	One unit per slide
PDF	One unit per page
Email (`.eml`, `.msg`)	One unit per email
Mailbox (`.mbox`)	Multiple extraction results (one per email)

Notes:

Word formats do not store reliable page boundaries, so units are document-level.
iterate_units(ignore_images=True) skips image payloads in unit objects for better performance.

CLI

After installation, sharepoint2text is available.

Plain text output:

sharepoint2text --file /path/to/file.docx > extraction.txt

JSON output:

sharepoint2text --file /path/to/file.docx --json > extraction.json

Options

Option	Description
`--file FILE`, `-f FILE`	Required input file
`--output FILE`, `-o FILE`	Write output to file (default: stdout)
`--json`, `-j`	Emit `list[extraction_object]`
`--json-unit`, `-u`	Emit `list[unit_object]`
`--include-images`, `-i`	Include binary image payloads as base64 in JSON output
`--no-attachments`, `-n`	Exclude email attachments from CLI extraction output
`--max-file-size-mb`, `-m`	Maximum input size in MiB (default: `100`, use `0` to disable)
`--version`, `-v`	Print CLI version

Rules:

--json and --json-unit are mutually exclusive.
--include-images requires --json or --json-unit.
CLI enforces a configurable input file limit (default 100 MiB; override with --max-file-size-mb / -m).

Optional SharePoint Integration

sharepoint_io is optional. It helps list/download files from SharePoint, while extraction still runs through sharepoint2text.

import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    SharePointRestClient,
)

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)

client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

for file_meta in client.list_all_files():
    data = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)
    for result in extractor(io.BytesIO(data), path=file_meta.name):
        print(result.get_full_text()[:200])

Setup details: sharepoint2text/sharepoint_io/SETUP.md

Supported Formats

Microsoft Office

Modern: .docx, .docm, .xlsx, .xlsm, .xlsb, .pptx, .pptm
Legacy: .doc, .dot, .xls, .xlt, .ppt, .pot, .pps, .rtf
Template/show aliases are auto-mapped (for example .dotx -> .docx, .ppsx -> .pptx)

OpenDocument

.odt, .ods, .odp, .odg, .odf
Template aliases supported (.ott, .ots, .otp)

Email

.eml, .msg, .mbox
Email extraction includes sender/recipient metadata, subject, and body (body_plain / body_html).
.eml and .msg parse attachments and store them on EmailContent.attachments.
.mbox extraction currently focuses on message headers/body and does not parse/store attachments.
Parsed supported attachments can be extracted via EmailContent.iterate_supported_attachments().
If supported-attachment extraction fails, the default behavior is to raise; use skip_failed=True to continue.

Plain text and config/data

.txt, .md, .csv, .tsv, .json
.yaml, .yml, .xml, .log, .ini, .cfg, .conf, .properties

Web and ebook

.html, .htm, .mhtml, .mht, .epub

PDF

.pdf

Archive Processing and Security

Archives are processed one level deep. Supported non-archive files inside the archive can yield extraction results. Nested archives are intentionally skipped as a safety guard.

Built-in safeguards include zip-bomb protections and file size limits. For 7z, extraction is limited to 100MB archives. Archive entries may also be skipped when they exceed internal per-entry size limits or fail extraction.

Limitations and Caveats

PDF

No OCR. Scanned-image PDFs may return empty text.
Structured table extraction is not implemented for PDF (iterate_tables() is empty).
Password-protected PDFs (non-empty password) raise ExtractionFileEncryptedError.
Some JBIG2 images need jbig2dec installed for image decoding.

General

Inputs are expected to be already decrypted. If a file has encryption, DRM, password protection, or similar security controls, remove/unlock those before calling sharepoint2text.
Very large or highly compressed files may hit protection limits.
Raise limits only for trusted inputs.

API Cheat Sheet

Main entry points

import sharepoint2text

sharepoint2text.read_file(
    path,
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.read_bytes(
    data,
    extension="pdf",      # or ".pdf"
    mime_type=None,        # e.g. "application/pdf"
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)

Format-specific extractors (selected)

Office/OpenDocument: read_docx, read_doc, read_xlsx, read_xls, read_pptx, read_ppt, read_rtf, read_odt, read_ods, read_odp, read_odg, read_odf
Other documents: read_pdf, read_html, read_epub, read_mhtml, read_plain_text
Email: read_eml_email, read_msg_email, read_mbox_email

All extractor functions accept a binary stream plus optional path and return generators.

Email helper API:

EmailContent.iterate_supported_attachments(skip_failed=False) extracts supported parsed attachments on demand (primarily from .eml/.msg).

Exceptions

Common exceptions:

ExtractionFileFormatNotSupportedError
ExtractionFileEncryptedError
ExtractionFileTooLargeError
ExtractionLegacyMicrosoftParsingError
ExtractionZipBombError
ExtractionFailedError

License

Apache 2.0. See LICENSE.

Disclaimer

This project is not affiliated with, endorsed by, or sponsored by Microsoft.

More Usage Examples

Extract email body plus supported attachments

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.eml"))

print(email.subject)
print(email.get_full_text())  # plain body if available, otherwise HTML body
print(f"Attachment count: {len(email.attachments)}")

# Extract supported attachment types (pdf, docx, pptx, etc.)
for attachment_result in email.iterate_supported_attachments():
    print(type(attachment_result).__name__)
    print(attachment_result.get_full_text()[:200])

Continue even if a supported attachment fails to extract

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.msg"))

for attachment_result in email.iterate_supported_attachments(skip_failed=True):
    print(attachment_result.get_metadata().filename)

Process a mailbox (`.mbox`) and read message bodies

import sharepoint2text

for email in sharepoint2text.read_file("team-archive.mbox"):
    print(f"Subject: {email.subject}")
    print(email.get_full_text()[:200])

Batch-extract units for RAG-style chunking

from pathlib import Path
import sharepoint2text

for path in Path("docs").rglob("*"):
    if not path.is_file() or not sharepoint2text.is_supported_file(path):
        continue
    for result in sharepoint2text.read_file(path):
        meta = result.get_metadata()
        for unit in result.iterate_units(ignore_images=True):
            chunk = unit.get_text().strip()
            if chunk:
                payload = {
                    "text": chunk,
                    "source": str(path),
                    "filename": meta.filename,
                    "unit_number": getattr(unit.get_metadata(), "unit_number", None),
                }
                # store payload in your index/vector DB

Extract from API bytes when you only know MIME type

import sharepoint2text

# Example: bytes from HTTP response
data = get_file_bytes_somehow()

result = next(
    sharepoint2text.read_bytes(
        data,
        mime_type="application/pdf",
        ignore_images=True,
    )
)
print(result.get_full_text()[:500])

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

horsmann

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.11

Apr 15, 2026

1.1.1

Apr 4, 2026

1.1.0

Apr 4, 2026

This version

1.0.0

Feb 21, 2026

0.9.0

Feb 2, 2026

0.8.1

Jan 10, 2026

0.8.0

Jan 4, 2026

0.7.0

Jan 3, 2026

0.6.0

Jan 1, 2026

0.5.0

Dec 29, 2025

0.4.1

Dec 28, 2025

0.4.0

Dec 28, 2025

0.3.0

Dec 28, 2025

0.2.0

Dec 27, 2025

0.1.1

Dec 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sharepoint_to_text-1.0.0.tar.gz (183.2 kB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sharepoint_to_text-1.0.0-py3-none-any.whl (223.5 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file sharepoint_to_text-1.0.0.tar.gz.

File metadata

Download URL: sharepoint_to_text-1.0.0.tar.gz
Upload date: Feb 21, 2026
Size: 183.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sharepoint_to_text-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`043731c89deaefa60065bf612b4e1c696dd9830f1cef7ab6b5180883c390cf93`
MD5	`aa0f6ade41434ca06c347dafc28b8a7b`
BLAKE2b-256	`91b6fff69d134a9c78b52c03db385ba4bebaf808720a400b7f1296a93e93ab2d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.0.0.tar.gz:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sharepoint_to_text-1.0.0.tar.gz
- Subject digest: 043731c89deaefa60065bf612b4e1c696dd9830f1cef7ab6b5180883c390cf93
- Sigstore transparency entry: 975932872
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: Horsmann/sharepoint-to-text@efa7ea69108362b711588368750dfacc6fd6c997
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Horsmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@efa7ea69108362b711588368750dfacc6fd6c997
- Trigger Event: release

File details

Details for the file sharepoint_to_text-1.0.0-py3-none-any.whl.

File metadata

Download URL: sharepoint_to_text-1.0.0-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 223.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sharepoint_to_text-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a05d0b67a02805d0685d007977b1f9f1cdd26d92198095eb45f31bc163340ce`
MD5	`7b82d2d970eade74508defb84f0bcf8a`
BLAKE2b-256	`b4a75696b7ed8d713ffa5d7bdd3392fe049834b378bbfb28b592a29b4486618b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sharepoint_to_text-1.0.0-py3-none-any.whl
- Subject digest: 0a05d0b67a02805d0685d007977b1f9f1cdd26d92198095eb45f31bc163340ce
- Sigstore transparency entry: 975932875
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: Horsmann/sharepoint-to-text@efa7ea69108362b711588368750dfacc6fd6c997
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/Horsmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@efa7ea69108362b711588368750dfacc6fd6c997
- Trigger Event: release

sharepoint-to-text 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sharepoint-to-text

Table of Contents

Why Use This Library

Install

Quick Start

1) Read any supported local file

2) Read bytes already in memory

3) Choose chunking strategy

4) Serialize results

Core Interface

Which text method should you use?

What iterate_units() means by format

CLI

Options

Optional SharePoint Integration

Supported Formats

Microsoft Office

OpenDocument

Email

Plain text and config/data

Web and ebook

PDF

Archives

Archive Processing and Security

Limitations and Caveats

PDF

General

API Cheat Sheet

Main entry points

Format-specific extractors (selected)

Exceptions

License

Disclaimer

More Usage Examples

Extract email body plus supported attachments

Continue even if a supported attachment fails to extract

Process a mailbox (.mbox) and read message bodies

Batch-extract units for RAG-style chunking

Extract from API bytes when you only know MIME type

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

What `iterate_units()` means by format

Process a mailbox (`.mbox`) and read message bodies