Skip to main content

Text extraction library for typical file formats found in SharePoint repositories

Project description

sharepoint-to-text

A pure-Python library for extracting text and structured content from files commonly found in SharePoint ecosystems:

  • Microsoft Office (modern and legacy)
  • OpenDocument
  • PDF
  • Email formats
  • Plain text and config formats
  • HTML/EPUB/MHTML
  • Archives containing supported files

It also includes an optional SharePoint Graph client (sharepoint_io) for listing/downloading files before extraction.

Table of Contents

Why Use This Library

  • Pure Python (no Java runtime, no LibreOffice subprocesses)
  • Unified extraction interface across many file types
  • Works with file paths and in-memory bytes
  • Suitable for RAG/indexing pipelines where chunking and metadata matter
  • Handles both modern and legacy Office formats in one API

Install

uv add sharepoint-to-text

Optional PDF crypto acceleration:

uv add "sharepoint-to-text[pdf-crypto]"

From source:

git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups

Quick Start

1) Read any supported local file

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())

read_file(...) returns a generator. Most files produce one result, but archives and .mbox can produce multiple.

2) Read bytes already in memory

import sharepoint2text

payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())

3) Choose chunking strategy

import sharepoint2text

result = next(sharepoint2text.read_file("report.pdf"))

# Single text blob
full_text = result.get_full_text()

# Structured chunks (page/slide/sheet depending on format)
for unit in result.iterate_units():
    print(unit.get_text())
    print(unit.get_metadata())

4) Serialize results

import json
import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))

Restore from JSON:

from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(result.to_json())

Core Interface

All extracted results implement a common interface (ExtractionInterface):

  • get_full_text()
  • iterate_units()
  • iterate_images()
  • iterate_tables()
  • get_metadata()
  • to_json() / from_json(...)

Use this interface when you want one pipeline that works across formats.

Which text method should you use?

Goal Method
One string per document get_full_text()
Chunk by structure (RAG/citations) iterate_units()
All images in a file iterate_images()
All tables in a file iterate_tables()

What iterate_units() means by format

Format family Units yielded
Word / text docs (.docx, .doc, .odt, plain text, config files) Usually one unit
Spreadsheets (.xlsx, .xls, .ods) One unit per sheet
Presentations (.pptx, .ppt, .odp) One unit per slide
PDF One unit per page
Email (.eml, .msg) One unit per email
Mailbox (.mbox) Multiple extraction results (one per email)

Notes:

  • Word formats do not store reliable page boundaries, so units are document-level.
  • iterate_units(ignore_images=True) skips image payloads in unit objects for better performance.

CLI

After installation, sharepoint2text is available.

Plain text output:

sharepoint2text --file /path/to/file.docx > extraction.txt

JSON output:

sharepoint2text --file /path/to/file.docx --json > extraction.json

Options

Option Description
--file FILE, -f FILE Required input file
--output FILE, -o FILE Write output to file (default: stdout)
--json, -j Emit list[extraction_object]
--json-unit, -u Emit list[unit_object]
--include-images, -i Include binary image payloads as base64 in JSON output
--no-attachments, -n Exclude email attachments from CLI extraction output
--max-file-size-mb, -m Maximum input size in MiB (default: 100, use 0 to disable)
--version, -v Print CLI version

Rules:

  • --json and --json-unit are mutually exclusive.
  • --include-images requires --json or --json-unit.
  • CLI enforces a configurable input file limit (default 100 MiB; override with --max-file-size-mb / -m).

Optional SharePoint Integration

sharepoint_io is optional. It helps list/download files from SharePoint, while extraction still runs through sharepoint2text.

import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    SharePointRestClient,
)

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)

client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

for file_meta in client.list_all_files():
    data = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)
    for result in extractor(io.BytesIO(data), path=file_meta.name):
        print(result.get_full_text()[:200])

Setup details: sharepoint2text/sharepoint_io/SETUP.md

Supported Formats

Microsoft Office

  • Modern: .docx, .docm, .xlsx, .xlsm, .xlsb, .pptx, .pptm
  • Legacy: .doc, .dot, .xls, .xlt, .ppt, .pot, .pps, .rtf
  • Template/show aliases are auto-mapped (for example .dotx -> .docx, .ppsx -> .pptx)

OpenDocument

  • .odt, .ods, .odp, .odg, .odf
  • Template aliases supported (.ott, .ots, .otp)

Email

  • .eml, .msg, .mbox
  • Email extraction includes sender/recipient metadata, subject, and body (body_plain / body_html).
  • .eml and .msg parse attachments and store them on EmailContent.attachments.
  • .mbox extraction currently focuses on message headers/body and does not parse/store attachments.
  • Parsed supported attachments can be extracted via EmailContent.iterate_supported_attachments().
  • If supported-attachment extraction fails, the default behavior is to raise; use skip_failed=True to continue.

Plain text and config/data

  • .txt, .md, .csv, .tsv, .json
  • .yaml, .yml, .xml, .log, .ini, .cfg, .conf, .properties

Web and ebook

  • .html, .htm, .mhtml, .mht, .epub

PDF

  • .pdf

Archives

  • .zip, .tar, .7z
  • Compressed tar aliases: .tar.gz/.tgz, .tar.bz2/.tbz2, .tar.xz/.txz
  • .gz, .bz2, .xz are routed as compressed tar variants

Archive Processing and Security

Archives are processed one level deep. Supported non-archive files inside the archive can yield extraction results. Nested archives are intentionally skipped as a safety guard.

Built-in safeguards include zip-bomb protections and file size limits. For 7z, extraction is limited to 100MB archives. Archive entries may also be skipped when they exceed internal per-entry size limits or fail extraction.

Limitations and Caveats

PDF

  • No OCR. Scanned-image PDFs may return empty text.
  • Structured table extraction is not implemented for PDF (iterate_tables() is empty).
  • Password-protected PDFs (non-empty password) raise ExtractionFileEncryptedError.
  • Some JBIG2 images need jbig2dec installed for image decoding.

General

  • Inputs are expected to be already decrypted. If a file has encryption, DRM, password protection, or similar security controls, remove/unlock those before calling sharepoint2text.
  • Very large or highly compressed files may hit protection limits.
  • Raise limits only for trusted inputs.

API Cheat Sheet

Main entry points

import sharepoint2text

sharepoint2text.read_file(
    path,
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.read_bytes(
    data,
    extension="pdf",      # or ".pdf"
    mime_type=None,        # e.g. "application/pdf"
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
)

sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)

Format-specific extractors (selected)

  • Office/OpenDocument: read_docx, read_doc, read_xlsx, read_xls, read_pptx, read_ppt, read_rtf, read_odt, read_ods, read_odp, read_odg, read_odf
  • Other documents: read_pdf, read_html, read_epub, read_mhtml, read_plain_text
  • Email: read_eml_email, read_msg_email, read_mbox_email

All extractor functions accept a binary stream plus optional path and return generators.

Email helper API:

  • EmailContent.iterate_supported_attachments(skip_failed=False) extracts supported parsed attachments on demand (primarily from .eml/.msg).

Exceptions

Common exceptions:

  • ExtractionFileFormatNotSupportedError
  • ExtractionFileEncryptedError
  • ExtractionFileTooLargeError
  • ExtractionLegacyMicrosoftParsingError
  • ExtractionZipBombError
  • ExtractionFailedError

License

Apache 2.0. See LICENSE.

Disclaimer

This project is not affiliated with, endorsed by, or sponsored by Microsoft.

More Usage Examples

Extract email body plus supported attachments

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.eml"))

print(email.subject)
print(email.get_full_text())  # plain body if available, otherwise HTML body
print(f"Attachment count: {len(email.attachments)}")

# Extract supported attachment types (pdf, docx, pptx, etc.)
for attachment_result in email.iterate_supported_attachments():
    print(type(attachment_result).__name__)
    print(attachment_result.get_full_text()[:200])

Continue even if a supported attachment fails to extract

import sharepoint2text

email = next(sharepoint2text.read_file("message-with-attachments.msg"))

for attachment_result in email.iterate_supported_attachments(skip_failed=True):
    print(attachment_result.get_metadata().filename)

Process a mailbox (.mbox) and read message bodies

import sharepoint2text

for email in sharepoint2text.read_file("team-archive.mbox"):
    print(f"Subject: {email.subject}")
    print(email.get_full_text()[:200])

Batch-extract units for RAG-style chunking

from pathlib import Path
import sharepoint2text

for path in Path("docs").rglob("*"):
    if not path.is_file() or not sharepoint2text.is_supported_file(path):
        continue
    for result in sharepoint2text.read_file(path):
        meta = result.get_metadata()
        for unit in result.iterate_units(ignore_images=True):
            chunk = unit.get_text().strip()
            if chunk:
                payload = {
                    "text": chunk,
                    "source": str(path),
                    "filename": meta.filename,
                    "unit_number": getattr(unit.get_metadata(), "unit_number", None),
                }
                # store payload in your index/vector DB

Extract from API bytes when you only know MIME type

import sharepoint2text

# Example: bytes from HTTP response
data = get_file_bytes_somehow()

result = next(
    sharepoint2text.read_bytes(
        data,
        mime_type="application/pdf",
        ignore_images=True,
    )
)
print(result.get_full_text()[:500])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sharepoint_to_text-1.0.0.tar.gz (183.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sharepoint_to_text-1.0.0-py3-none-any.whl (223.5 kB view details)

Uploaded Python 3

File details

Details for the file sharepoint_to_text-1.0.0.tar.gz.

File metadata

  • Download URL: sharepoint_to_text-1.0.0.tar.gz
  • Upload date:
  • Size: 183.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sharepoint_to_text-1.0.0.tar.gz
Algorithm Hash digest
SHA256 043731c89deaefa60065bf612b4e1c696dd9830f1cef7ab6b5180883c390cf93
MD5 aa0f6ade41434ca06c347dafc28b8a7b
BLAKE2b-256 91b6fff69d134a9c78b52c03db385ba4bebaf808720a400b7f1296a93e93ab2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.0.0.tar.gz:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sharepoint_to_text-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sharepoint_to_text-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a05d0b67a02805d0685d007977b1f9f1cdd26d92198095eb45f31bc163340ce
MD5 7b82d2d970eade74508defb84f0bcf8a
BLAKE2b-256 b4a75696b7ed8d713ffa5d7bdd3392fe049834b378bbfb28b592a29b4486618b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page