Skip to main content

Text extraction library for typical file formats found in SharePoint repositories

Project description

sharepoint-to-text

sharepoint-to-text is a typed, pure-Python library for extracting text and structured content from file types commonly found in SharePoint and document-management workflows.

It is built for software engineers who need one extraction interface across modern Microsoft Office files, legacy Office files, OpenDocument, PDF, email, HTML-like content, plain-text formats, and archives.

Why This Package Exists

Document ingestion pipelines usually fail in one of two ways:

  • they only support a narrow set of office formats
  • they require heavyweight external runtimes such as LibreOffice, Java, or platform-specific tooling

sharepoint-to-text takes a different approach:

  • Pure Python library API and CLI
  • One routing layer for many file types
  • Works with file paths and in-memory bytes
  • Typed extraction objects with metadata, units, images, and tables
  • Suitable for indexing, RAG, ETL, compliance review, and migration tooling

At A Glance

Item Details
Python >=3.10
Install uv add sharepoint-to-text
Runtime model Pure Python
Primary interfaces read_file(...), read_bytes(...), read_many(...), CLI
Output model Generator of typed extraction objects
SharePoint access Optional sharepoint_io helper for Graph-backed listing/download

Who This Is For

This package is a good fit if you need to:

  • normalize text extraction across many enterprise document formats
  • process documents from disk, APIs, queues, or object storage
  • preserve some document structure for downstream chunking or citations
  • run extraction in Python-only environments such as services, workers, or serverless jobs

It is not a full document rendering engine, OCR system, or layout-preserving conversion tool.

Installation

Package install

uv add sharepoint-to-text

With pip:

pip install sharepoint-to-text

Development install

git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups

Quick Start

Extract from a local file

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())

Extract from in-memory bytes

import sharepoint2text

payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())

Use structural units for chunking

import sharepoint2text

result = next(sharepoint2text.read_file("report.pdf", ignore_images=True))

for unit in result.iterate_units():
    print(unit.get_text())
    print(unit.get_metadata())

Batch extraction from a folder

import sharepoint2text

# Extract only Word and PDF files from a folder
for result in sharepoint2text.read_many("docs", suffixes=[".docx", ".pdf"]):
    print(result.get_metadata().file_path)
    print(result.get_full_text()[:200])

Extract all supported files from a folder

import sharepoint2text

# Extract all supported file formats recursively
for result in sharepoint2text.read_many("docs", extract_all_supported=True):
    for unit in result.iterate_units(ignore_images=True):
        text = unit.get_text().strip()
        if text:
            print(result.get_metadata().file_path, text[:120])

Core API

Main entry points

import sharepoint2text

sharepoint2text.read_file(
    path,
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
)

sharepoint2text.read_bytes(
    data,
    extension="pdf",      # or ".pdf"
    mime_type=None,        # for example "application/pdf"
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
)

sharepoint2text.read_many(
    folder_path,
    suffixes=[".docx", ".pdf"],  # list of extensions to extract
    extract_all_supported=False,  # or True to extract all supported formats
    max_file_size=100 * 1024 * 1024,
    ignore_images=False,
    force_plain_text=False,
    include_attachments=True,
    recursive=True,               # traverse subdirectories
)

sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)

Batch extraction with read_many

The read_many function extracts content from multiple files in a folder:

Parameter Description
folder_path Path to the folder to traverse
suffixes List of file extensions to extract (e.g., [".docx", ".pdf"])
extract_all_supported If True, extract all supported formats (mutually exclusive with suffixes)
recursive If True (default), traverse subdirectories

Configuration rules:

  • You must specify either suffixes or extract_all_supported=True
  • Specifying both raises InvalidConfigurationError
  • Suffixes are normalized (with or without leading dot)
  • Extraction continues on errors, logging warnings for failed files

Result model

All extracted results implement a common interface:

  • get_full_text()
  • iterate_units()
  • iterate_images()
  • iterate_tables()
  • get_metadata()
  • to_json() / from_json(...)

Use get_full_text() when you want one string per extraction result.

Use iterate_units() when you want coarse structural chunks such as:

  • one page per PDF unit
  • one slide per presentation unit
  • one sheet per spreadsheet unit
  • one document-level unit for most text-document formats

Generator semantics matter

The API returns generators because some inputs can produce multiple results:

  • archives can yield one result per supported member file
  • .mbox can yield one result per email
  • email extraction can recursively expose supported attachments

For single-document formats, next(...) is usually the simplest call pattern.

CLI

The package installs a sharepoint2text command.

Single file extraction

Plain text output:

sharepoint2text --file /path/to/file.docx

Full extraction objects as JSON:

sharepoint2text --file /path/to/file.docx --json

Per-unit JSON:

sharepoint2text --file /path/to/file.pdf --json-unit

Folder extraction

Extract all supported files from a folder:

sharepoint2text --folder /path/to/folder

Extract only specific file types:

sharepoint2text --folder /path/to/folder --suffixes .docx,.pdf,.txt

Non-recursive (top-level only):

sharepoint2text --folder /path/to/folder --no-recursive

Folder output (mirrored structure)

When extracting from a folder, output to another folder to preserve the directory structure:

# Write each file separately to output folder
sharepoint2text --folder /input/docs --output /output/extracted/

# The output structure mirrors the input:
# /input/docs/report.docx      -> /output/extracted/report.txt
# /input/docs/sub/data.xlsx    -> /output/extracted/sub/data.txt

Output path behavior:

  • If --output is an existing directory, files are written separately
  • If --output is a new path without extension, it's created as a directory
  • If --output has a file extension, all results are combined into that file

CLI options

Option Description
--file FILE, -f FILE Path to a single file to extract
--folder FOLDER, -d FOLDER Path to a folder to extract files from (recursive by default)
--suffixes SUFFIXES, -s SUFFIXES Comma-separated file suffixes to filter (e.g., .docx,.pdf). Only with --folder. If omitted, extracts all supported types.
--no-recursive Only extract top-level files (no subdirectories). Only with --folder.
--output PATH, -o PATH Output path: file (combined) or folder (separate files mirroring input structure)
--json, -j Emit list[extraction_object]
--json-unit, -u Emit list[unit_object]
--include-images, -i Include base64 image payloads in JSON output
--no-attachments, -n Skip expanding supported email attachments
--max-file-size-mb, -m Maximum input size in MiB, default 100, use 0 to disable
--version, -v Print CLI version

Important CLI rules:

  • --file and --folder are mutually exclusive (one is required)
  • --suffixes and --no-recursive only work with --folder
  • --json and --json-unit are mutually exclusive
  • --include-images requires --json or --json-unit
  • the CLI enforces the same file-size guard as the Python API

Supported Formats

Microsoft Office

  • Modern: .docx, .docm, .xlsx, .xlsm, .xlsb, .pptx, .pptm
  • Legacy: .doc, .dot, .xls, .xlt, .ppt, .pot, .pps, .rtf
  • Alias mapping: .dotx, .dotm, .xltx, .xltm, .potx, .potm, .ppsx, .ppsm

OpenDocument

  • .odt, .ods, .odp, .odg, .odf
  • Alias mapping: .ott, .ots, .otp

Email

  • .eml, .msg, .mbox
  • .eml and .msg can parse and expose supported attachments
  • .mbox yields one result per message

Plain text and data-like formats

  • .txt, .md, .csv, .tsv, .json
  • .yaml, .yml, .xml, .log, .ini, .cfg, .conf, .properties

Web and ebook

  • .html, .htm, .mhtml, .mht, .epub

PDF

  • .pdf

Archives

  • .zip, .tar, .7z
  • .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz, .txz

For a behavior-focused view of units, attachments, and caveats by format, see doc/format-matrix.md.

SharePoint Integration

The extraction library works independently of SharePoint. The optional sharepoint_io module is a separate helper layer for listing and downloading files through Microsoft Graph before extraction.

import io

import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    SharePointRestClient,
)

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)

client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

for file_meta in client.list_all_files():
    data = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)
    for result in extractor(io.BytesIO(data), path=file_meta.name):
        print(result.get_full_text()[:200])

Setup details live in sharepoint2text/sharepoint_io/SETUP.md.

Operational Constraints

These are the points an engineering team usually needs before adopting the package:

  • No OCR: scanned-image PDFs will often produce little or no text
  • No external office renderer: output is extraction-oriented, not fidelity-oriented
  • Word-like formats do not expose reliable page boundaries
  • Nested archives are intentionally skipped
  • Password-protected or encrypted inputs raise extraction errors
  • Large files and highly compressed archives are guarded by size limits and zip-bomb protections

Archive behavior

  • archives are processed one level deep
  • supported non-archive files inside an archive can yield extraction results
  • nested archives are skipped as a safety measure
  • 7z extraction is capped at 100 MB internally

Performance guidance

  • set ignore_images=True when image payloads are not needed
  • use iterate_units() for chunk-wise downstream processing instead of materializing one large string when structure matters
  • keep size limits enabled unless you trust the input source

Failure Modes and Exceptions

Common exceptions:

  • ExtractionFileFormatNotSupportedError
  • ExtractionFileEncryptedError
  • ExtractionFileTooLargeError
  • ExtractionLegacyMicrosoftParsingError
  • ExtractionZipBombError
  • ExtractionPathTraversalError
  • ExtractionFailedError
  • InvalidConfigurationError (for read_many with conflicting options)

If you are integrating this into a service, see doc/integration-guide.md and doc/troubleshooting.md.

Serialization

import json

import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
payload = result.to_json()

print(json.dumps(payload))

Restore from JSON:

from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(payload)

Additional Documentation

License

Apache 2.0. See LICENSE.

Disclaimer

This project is not affiliated with, endorsed by, or sponsored by Microsoft.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sharepoint_to_text-1.1.1.tar.gz (194.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sharepoint_to_text-1.1.1-py3-none-any.whl (237.6 kB view details)

Uploaded Python 3

File details

Details for the file sharepoint_to_text-1.1.1.tar.gz.

File metadata

  • Download URL: sharepoint_to_text-1.1.1.tar.gz
  • Upload date:
  • Size: 194.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sharepoint_to_text-1.1.1.tar.gz
Algorithm Hash digest
SHA256 ab98e8b52801cf032e2a6539802fd2a3f792e24d7797040aa20522cbd2605004
MD5 6b06c49decd79476d1a7e3faab693f0f
BLAKE2b-256 396b72a5087220221afa31d8a883e4673d8a3be9a0b83cdddd9fb2f3e97bd3ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.1.1.tar.gz:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sharepoint_to_text-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sharepoint_to_text-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0df85f97bcf0ebd7c611e2d9f5cf414e03a36fd0381ae5d8431ee02cda70a9c3
MD5 9a068e81be08407f8736d57aa13c7322
BLAKE2b-256 73f27658a850700fd1bac879ce87c28a84b4b88b9c940b90123ec1a9e4a032f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sharepoint_to_text-1.1.1-py3-none-any.whl:

Publisher: publish.yml on Horsmann/sharepoint-to-text

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page