Skip to main content

A text-extraction application that facilitates string consumption.

Project description

TextSpitter

Transforming documents into insights, effortlessly and efficiently.

license last-commit repo-top-language repo-language-count docs

Built with the tools and technologies:

TOML Pytest Python GitHub%20Actions uv


Table of Contents


Overview

TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types โ€” file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes โ€” into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.

Why TextSpitter?

  • ๐Ÿ“„ Multi-format extraction โ€” PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
  • ๐Ÿ”Œ Stream-first API โ€” accepts file paths, BytesIO, SpooledTemporaryFile, or raw bytes; no temp files required.
  • ๐Ÿ› ๏ธ Optional structured logging โ€” install textspitter[logging] to add loguru; falls back to stdlib logging transparently.
  • ๐Ÿ–ฅ๏ธ CLI included โ€” uv tool install textspitter gives you a textspitter command for quick one-off extractions.
  • ๐Ÿš€ Automated CI/CD โ€” GitHub Actions run the test matrix (Python 3.12โ€“3.14) and publish docs to GitHub Pages on every push.

Features

Component Details
โš™๏ธ Architecture
  • Three-layer design: TextSpitter convenience function โ†’ WordLoader dispatcher โ†’ FileExtractor low-level reader
  • OOP design enables straightforward subclassing and extension
๐Ÿ”ฉ Code Quality
  • Strict PEP 8 / ruff linting with black formatting
  • Full type hints; ships a py.typed PEP 561 marker
๐Ÿ“„ Documentation
  • API docs auto-published to GitHub Pages via pdoc
  • Quick-start guide, tutorial, use-case examples, and recipes
๐Ÿ”Œ Integrations
  • CI/CD with GitHub Actions (tests + docs + PyPI publish)
  • Package management via uv; installable via pip or uv tool install
๐Ÿงฉ Modularity
  • Core FileExtractor separated from dispatch logic in WordLoader
  • Logging abstraction in logger.py isolates the optional loguru dependency
๐Ÿงช Testing
  • ~70 pytest tests covering all readers and input types
  • Dual-mode log capture fixture works with or without loguru
โšก๏ธ Performance
  • Class-level frozenset / dict constants avoid per-call allocation
  • Stream rewind avoids re-reading large files
๐Ÿ“ฆ Dependencies
  • Core: pymupdf, pypdf, python-docx
  • Optional logging: loguru (pip install textspitter[logging])

Project Structure

TextSpitter/
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ”œโ”€โ”€ docs.yml             # pdoc โ†’ GitHub Pages
โ”‚       โ”œโ”€โ”€ python-publish.yml   # PyPI release
โ”‚       โ””โ”€โ”€ tests.yml            # pytest matrix (3.12 โ€“ 3.14)
โ”œโ”€โ”€ TextSpitter/
โ”‚   โ”œโ”€โ”€ __init__.py              # TextSpitter() + WordLoader public API
โ”‚   โ”œโ”€โ”€ cli.py                   # argparse CLI entry point
โ”‚   โ”œโ”€โ”€ core.py                  # FileExtractor class
โ”‚   โ”œโ”€โ”€ logger.py                # Optional loguru / stdlib fallback
โ”‚   โ”œโ”€โ”€ main.py                  # WordLoader dispatcher
โ”‚   โ”œโ”€โ”€ py.typed                 # PEP 561 marker
โ”‚   โ””โ”€โ”€ guide/                   # pdoc documentation pages (subpackage)
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ conftest.py              # shared fixtures (log_capture)
โ”‚   โ”œโ”€โ”€ test_cli.py
โ”‚   โ”œโ”€โ”€ test_file_extractor.py
โ”‚   โ”œโ”€โ”€ test_txt.py
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ uv.lock

Getting Started

Prerequisites

  • Python โ‰ฅ 3.12
  • uv (recommended) or pip

Installation

From PyPI:

pip install textspitter

# With optional loguru logging
pip install "textspitter[logging]"

Using uv:

uv add textspitter

# With optional loguru logging
uv add "textspitter[logging]"

As a standalone CLI tool:

uv tool install textspitter

From source:

git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --dev

Usage

As a library (one-liner):

from TextSpitter import TextSpitter

# From a file path
text = TextSpitter(filename="report.pdf")
print(text)

# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")

# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")

Using the WordLoader class directly:

from TextSpitter.main import WordLoader

loader = WordLoader(filename="data.csv")
text = loader.file_load()

As a CLI tool:

# Extract a single file to stdout
textspitter report.pdf

# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txt

Testing

uv run pytest tests/

# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing

Roadmap

v1.x (current)

  • Stream-based API (BytesIO, SpooledTemporaryFile, raw bytes)
  • CLI entry point (uv tool install textspitter)
  • Optional loguru logging with stdlib fallback
  • Programming-language file support (50 + extensions)
  • CI matrix (Python 3.12 โ€“ 3.14) + GitHub Pages docs
  • Async extraction API
  • CSV โ†’ structured output (list of dicts)
  • PPTX support

v2.0 โ€” Rust backend (full roadmap)

  • Rust splitting core via PyO3 + Maturin โ€” 10xโ€“40x batch throughput
  • Graceful Python fallback when Rust extension is unavailable
  • manylinux wheels on PyPI โ€” zero-compile install for Linux users
  • Memory-mapped file processing for very large PDFs (memmap2)
  • SIMD-accelerated string search for separator detection
  • Streaming iterator API (yield chunks instead of collecting all)
  • Optional SIMD feature flag (pip install "textspitter[simd]")

Contributing

Contributing Guidelines
  1. Fork the Repository: Fork the project to your GitHub account.
  2. Clone Locally: Clone the forked repository.
    git clone https://github.com/fsecada01/TextSpitter.git
    
  3. Create a New Branch: Always work on a new branch.
    git checkout -b new-feature-x
    
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message.
    git commit -m 'Add new feature x.'
    
  6. Push to GitHub: Push the changes to your fork.
    git push origin new-feature-x
    
  7. Submit a Pull Request: Create a PR against main. Describe the changes and motivation clearly.
  8. Review: Once approved, your PR will be merged. Thanks for contributing!
Contributor Graph


License

TextSpitter is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textspitter-2.0.0b1.tar.gz (180.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

textspitter-2.0.0b1-cp310-abi3-win_amd64.whl (4.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

textspitter-2.0.0b1-cp310-abi3-macosx_11_0_arm64.whl (4.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

textspitter-2.0.0b1-cp310-abi3-macosx_10_12_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file textspitter-2.0.0b1.tar.gz.

File metadata

  • Download URL: textspitter-2.0.0b1.tar.gz
  • Upload date:
  • Size: 180.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for textspitter-2.0.0b1.tar.gz
Algorithm Hash digest
SHA256 fce76e778d25be28172408220c1dcaae7b7f6c2500e1fe1077af31ca27348cdc
MD5 a12941b631e7b06811543df180a81570
BLAKE2b-256 fad0d2190762bdde67786306daf20ad783a685e4737cf2219e201ac3a6aeb748

See more details on using hashes here.

File details

Details for the file textspitter-2.0.0b1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for textspitter-2.0.0b1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7ee41bdd88a65b03a53ea456af97750ac6f1919a4b70335d52b026dffa11dd34
MD5 ad3414edfca2b9c77fb8f3ddd29fa1bd
BLAKE2b-256 3a30c75d5d310482fb193d2f9aac15d2a4ea37f9c618b0a5d19d567a48709ec3

See more details on using hashes here.

File details

Details for the file textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 14c9d17368ccddbfb44b9b571c11d4d218d21be899433decc98ed89df341b5f0
MD5 6dd627c9175214707501f068a4000829
BLAKE2b-256 0c28ee9c97746f83bfe8b703917c76145683aa3905bcfc011fc863cc0d7baa99

See more details on using hashes here.

File details

Details for the file textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for textspitter-2.0.0b1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 98bd9cf35d13a965e85902f47e25077e73120258b4543d2e63d6647ed473feac
MD5 c89fa9020223c2f7cc2686b0c50767ff
BLAKE2b-256 ad1cfd4543ad60a64b342d98ff6950e5b2b2f5d75ee8c9326f608e0ea294bc0b

See more details on using hashes here.

File details

Details for the file textspitter-2.0.0b1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for textspitter-2.0.0b1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 58f886c0cc08ecb52199a09231ce9b00c2314679b16eda5b9cdd9197744b7832
MD5 df71da57a2ed0e165ed19be55da95daa
BLAKE2b-256 be476099f0c5f3a13d458c0db7c3de4eed0437b7d2d0bdf7b8aae13e0171cd8c

See more details on using hashes here.

File details

Details for the file textspitter-2.0.0b1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for textspitter-2.0.0b1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 00987161e6ee7d24cfcb0465b2461bf121fe85247517b1f775580b7982c4f3d5
MD5 784a345dc56955b0190ce5eae71aab5b
BLAKE2b-256 693f0657127ce697ff329107da6010a68eb63b5420aa2bcd7e49d3f327b2387c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page