Text extraction library for typical file formats found in SharePoint repositories
Project description
sharepoint-to-text
sharepoint-to-text is a typed, pure-Python library for extracting text and structured content from file types commonly found in SharePoint and document-management workflows.
It is built for software engineers who need one extraction interface across modern Microsoft Office files, legacy Office files, OpenDocument, PDF, email, HTML-like content, plain-text formats, and archives.
Why This Package Exists
Document ingestion pipelines usually fail in one of two ways:
- they only support a narrow set of office formats
- they require heavyweight external runtimes such as LibreOffice, Java, or platform-specific tooling
sharepoint-to-text takes a different approach:
- Pure Python library API and CLI
- One routing layer for many file types
- Works with file paths and in-memory bytes
- Typed extraction objects with metadata, units, images, and tables
- Suitable for indexing, RAG, ETL, compliance review, and migration tooling
At A Glance
| Item | Details |
|---|---|
| Python | >=3.10 |
| Install | uv add sharepoint-to-text |
| Runtime model | Pure Python |
| Primary interfaces | read_file(...), read_bytes(...), read_many(...), CLI |
| Output model | Generator of typed extraction objects |
| SharePoint access | Optional sharepoint_io helper for Graph-backed listing/download |
Who This Is For
This package is a good fit if you need to:
- normalize text extraction across many enterprise document formats
- process documents from disk, APIs, queues, or object storage
- preserve some document structure for downstream chunking or citations
- run extraction in Python-only environments such as services, workers, or serverless jobs
It is not a full document rendering engine, OCR system, or layout-preserving conversion tool.
Installation
Package install
uv add sharepoint-to-text
With pip:
pip install sharepoint-to-text
Development install
git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups
Quick Start
Extract from a local file
import sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())
Extract from in-memory bytes
import sharepoint2text
payload = b"hello from memory"
result = next(sharepoint2text.read_bytes(payload, extension="txt"))
print(result.get_full_text())
Use structural units for chunking
import sharepoint2text
result = next(sharepoint2text.read_file("report.pdf", ignore_images=True))
for unit in result.iterate_units():
print(unit.get_text())
print(unit.get_metadata())
Batch extraction from a folder
import sharepoint2text
# Extract only Word and PDF files from a folder
for result in sharepoint2text.read_many("docs", suffixes=[".docx", ".pdf"]):
print(result.get_metadata().file_path)
print(result.get_full_text()[:200])
Extract all supported files from a folder
import sharepoint2text
# Extract all supported file formats recursively
for result in sharepoint2text.read_many("docs", extract_all_supported=True):
for unit in result.iterate_units(ignore_images=True):
text = unit.get_text().strip()
if text:
print(result.get_metadata().file_path, text[:120])
Core API
Main entry points
import sharepoint2text
sharepoint2text.read_file(
path,
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
)
sharepoint2text.read_bytes(
data,
extension="pdf", # or ".pdf"
mime_type=None, # for example "application/pdf"
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
)
sharepoint2text.read_many(
folder_path,
suffixes=[".docx", ".pdf"], # list of extensions to extract
extract_all_supported=False, # or True to extract all supported formats
max_file_size=100 * 1024 * 1024,
ignore_images=False,
force_plain_text=False,
include_attachments=True,
recursive=True, # traverse subdirectories
)
sharepoint2text.is_supported_file(path)
sharepoint2text.get_extractor(path)
Batch extraction with read_many
The read_many function extracts content from multiple files in a folder:
| Parameter | Description |
|---|---|
folder_path |
Path to the folder to traverse |
suffixes |
List of file extensions to extract (e.g., [".docx", ".pdf"]) |
extract_all_supported |
If True, extract all supported formats (mutually exclusive with suffixes) |
recursive |
If True (default), traverse subdirectories |
Configuration rules:
- You must specify either
suffixesorextract_all_supported=True - Specifying both raises
InvalidConfigurationError - Suffixes are normalized (with or without leading dot)
- Extraction continues on errors, logging warnings for failed files
Result model
All extracted results implement a common interface:
get_full_text()iterate_units()iterate_images()iterate_tables()get_metadata()to_json()/from_json(...)
Use get_full_text() when you want one string per extraction result.
Use iterate_units() when you want coarse structural chunks such as:
- one page per PDF unit
- one slide per presentation unit
- one sheet per spreadsheet unit
- one document-level unit for most text-document formats
Generator semantics matter
The API returns generators because some inputs can produce multiple results:
- archives can yield one result per supported member file
.mboxcan yield one result per email- email extraction can recursively expose supported attachments
For single-document formats, next(...) is usually the simplest call pattern.
CLI
The package installs a sharepoint2text command.
Single file extraction
Plain text output:
sharepoint2text --file /path/to/file.docx
Full extraction objects as JSON:
sharepoint2text --file /path/to/file.docx --json
Per-unit JSON:
sharepoint2text --file /path/to/file.pdf --json-unit
Folder extraction
Extract all supported files from a folder:
sharepoint2text --folder /path/to/folder
Extract only specific file types:
sharepoint2text --folder /path/to/folder --suffixes .docx,.pdf,.txt
Non-recursive (top-level only):
sharepoint2text --folder /path/to/folder --no-recursive
Folder output (mirrored structure)
When extracting from a folder, output to another folder to preserve the directory structure:
# Write each file separately to output folder
sharepoint2text --folder /input/docs --output /output/extracted/
# The output structure mirrors the input:
# /input/docs/report.docx -> /output/extracted/report.txt
# /input/docs/sub/data.xlsx -> /output/extracted/sub/data.txt
Output path behavior:
- If
--outputis an existing directory, files are written separately - If
--outputis a new path without extension, it's created as a directory - If
--outputhas a file extension, all results are combined into that file
CLI options
| Option | Description |
|---|---|
--file FILE, -f FILE |
Path to a single file to extract |
--folder FOLDER, -d FOLDER |
Path to a folder to extract files from (recursive by default) |
--suffixes SUFFIXES, -s SUFFIXES |
Comma-separated file suffixes to filter (e.g., .docx,.pdf). Only with --folder. If omitted, extracts all supported types. |
--no-recursive |
Only extract top-level files (no subdirectories). Only with --folder. |
--output PATH, -o PATH |
Output path: file (combined) or folder (separate files mirroring input structure) |
--json, -j |
Emit list[extraction_object] |
--json-unit, -u |
Emit list[unit_object] |
--include-images, -i |
Include base64 image payloads in JSON output |
--no-attachments, -n |
Skip expanding supported email attachments |
--max-file-size-mb, -m |
Maximum input size in MiB, default 100, use 0 to disable |
--version, -v |
Print CLI version |
Important CLI rules:
--fileand--folderare mutually exclusive (one is required)--suffixesand--no-recursiveonly work with--folder--jsonand--json-unitare mutually exclusive--include-imagesrequires--jsonor--json-unit- the CLI enforces the same file-size guard as the Python API
Supported Formats
Microsoft Office
- Modern:
.docx,.docm,.xlsx,.xlsm,.xlsb,.pptx,.pptm - Legacy:
.doc,.dot,.xls,.xlt,.ppt,.pot,.pps,.rtf - Alias mapping:
.dotx,.dotm,.xltx,.xltm,.potx,.potm,.ppsx,.ppsm
OpenDocument
.odt,.ods,.odp,.odg,.odf- Alias mapping:
.ott,.ots,.otp
.eml,.msg,.mbox.emland.msgcan parse and expose supported attachments.mboxyields one result per message
Plain text and data-like formats
.txt,.md,.csv,.tsv,.json.yaml,.yml,.xml,.log,.ini,.cfg,.conf,.properties
Web and ebook
.html,.htm,.mhtml,.mht,.epub
.pdf
Archives
.zip,.tar,.7z.tar.gz,.tgz,.tar.bz2,.tbz2,.tar.xz,.txz
For a behavior-focused view of units, attachments, and caveats by format, see doc/format-matrix.md.
SharePoint Integration
The extraction library works independently of SharePoint. The optional sharepoint_io module is a separate helper layer for listing and downloading files through Microsoft Graph before extraction.
import io
import sharepoint2text
from sharepoint2text.sharepoint_io import (
EntraIDAppCredentials,
SharePointRestClient,
)
credentials = EntraIDAppCredentials(
tenant_id="your-tenant-id",
client_id="your-client-id",
client_secret="your-client-secret",
)
client = SharePointRestClient(
site_url="https://contoso.sharepoint.com/sites/Documents",
credentials=credentials,
)
for file_meta in client.list_all_files():
data = client.download_file(file_meta.id)
extractor = sharepoint2text.get_extractor(file_meta.name)
for result in extractor(io.BytesIO(data), path=file_meta.name):
print(result.get_full_text()[:200])
Setup details live in sharepoint2text/sharepoint_io/SETUP.md.
Operational Constraints
These are the points an engineering team usually needs before adopting the package:
- No OCR: scanned-image PDFs will often produce little or no text
- No external office renderer: output is extraction-oriented, not fidelity-oriented
- Word-like formats do not expose reliable page boundaries
- Nested archives are intentionally skipped
- Password-protected or encrypted inputs raise extraction errors
- Large files and highly compressed archives are guarded by size limits and zip-bomb protections
Archive behavior
- archives are processed one level deep
- supported non-archive files inside an archive can yield extraction results
- nested archives are skipped as a safety measure
- 7z extraction is capped at 100 MB internally
Performance guidance
- set
ignore_images=Truewhen image payloads are not needed - use
iterate_units()for chunk-wise downstream processing instead of materializing one large string when structure matters - keep size limits enabled unless you trust the input source
Failure Modes and Exceptions
Common exceptions:
ExtractionFileFormatNotSupportedErrorExtractionFileEncryptedErrorExtractionFileTooLargeErrorExtractionLegacyMicrosoftParsingErrorExtractionZipBombErrorExtractionPathTraversalErrorExtractionFailedErrorInvalidConfigurationError(forread_manywith conflicting options)
If you are integrating this into a service, see doc/integration-guide.md and doc/troubleshooting.md.
Serialization
import json
import sharepoint2text
result = next(sharepoint2text.read_file("document.docx"))
payload = result.to_json()
print(json.dumps(payload))
Restore from JSON:
from sharepoint2text.parsing.extractors.data_types import ExtractionInterface
restored = ExtractionInterface.from_json(payload)
Additional Documentation
- doc/cli.md: complete CLI reference with examples
- doc/direct-extractors.md: call format-specific extractors directly and work with concrete result attributes
- doc/format-matrix.md: per-format behavior, units, and caveats
- doc/improvements.md: roadmap and improvement ideas
- CONTRIBUTING.md: contributor workflow
- CHANGELOG.md: release history
License
Apache 2.0. See LICENSE.
Disclaimer
This project is not affiliated with, endorsed by, or sponsored by Microsoft.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters