A text-extraction application that facilitates string consumption.
Project description
TextSpitter
Transforming documents into insights, effortlessly and efficiently.
Built with the tools and technologies:
Table of Contents
Overview
TextSpitter is a lightweight Python library that extracts text from documents and source-code files with a single call. It normalises diverse input types โ file paths, BytesIO streams, SpooledTemporaryFile objects, and raw bytes โ into plain strings, making it ideal for pipelines that feed text into LLMs, search engines, or data-processing workflows.
Why TextSpitter?
- ๐ Multi-format extraction โ PDF (PyMuPDF + PyPDF fallback), DOCX, TXT, CSV, and 50 + programming-language file types.
- ๐ Stream-first API โ accepts file paths,
BytesIO,SpooledTemporaryFile, or rawbytes; no temp files required. - ๐ ๏ธ Optional structured logging โ install
textspitter[logging]to addloguru; falls back to stdlibloggingtransparently. - ๐ฅ๏ธ CLI included โ
uv tool install textspittergives you atextspittercommand for quick one-off extractions. - ๐ Automated CI/CD โ GitHub Actions run the test matrix (Python 3.12โ3.14) and publish docs to GitHub Pages on every push.
Features
| Component | Details | |
|---|---|---|
| โ๏ธ | Architecture |
|
| ๐ฉ | Code Quality |
|
| ๐ | Documentation |
|
| ๐ | Integrations |
|
| ๐งฉ | Modularity |
|
| ๐งช | Testing |
|
| โก๏ธ | Performance |
|
| ๐ฆ | Dependencies |
|
Project Structure
TextSpitter/
โโโ .github/
โ โโโ workflows/
โ โโโ docs.yml # pdoc โ GitHub Pages
โ โโโ python-publish.yml # PyPI release
โ โโโ tests.yml # pytest matrix (3.12 โ 3.14)
โโโ TextSpitter/
โ โโโ __init__.py # TextSpitter() + WordLoader public API
โ โโโ cli.py # argparse CLI entry point
โ โโโ core.py # FileExtractor class
โ โโโ logger.py # Optional loguru / stdlib fallback
โ โโโ main.py # WordLoader dispatcher
โ โโโ py.typed # PEP 561 marker
โ โโโ guide/ # pdoc documentation pages (subpackage)
โโโ tests/
โ โโโ conftest.py # shared fixtures (log_capture)
โ โโโ test_cli.py
โ โโโ test_file_extractor.py
โ โโโ test_txt.py
โ โโโ ...
โโโ CHANGELOG.md
โโโ CONTRIBUTING.md
โโโ pyproject.toml
โโโ uv.lock
Getting Started
Prerequisites
- Python โฅ 3.12
- uv (recommended) or pip
Installation
From PyPI:
pip install textspitter
# With optional loguru logging
pip install "textspitter[logging]"
Using uv:
uv add textspitter
# With optional loguru logging
uv add "textspitter[logging]"
As a standalone CLI tool:
uv tool install textspitter
From source:
git clone https://github.com/fsecada01/TextSpitter.git
cd TextSpitter
uv sync --all-extras --dev
Usage
As a library (one-liner):
from TextSpitter import TextSpitter
# From a file path
text = TextSpitter(filename="report.pdf")
print(text)
# From a BytesIO stream
from io import BytesIO
text = TextSpitter(file_obj=BytesIO(pdf_bytes), filename="report.pdf")
# From raw bytes
text = TextSpitter(file_obj=docx_bytes, filename="contract.docx")
Using the WordLoader class directly:
from TextSpitter.main import WordLoader
loader = WordLoader(filename="data.csv")
text = loader.file_load()
As a CLI tool:
# Extract a single file to stdout
textspitter report.pdf
# Extract multiple files and write to a combined output file
textspitter file1.pdf file2.docx notes.txt -o combined.txt
Testing
uv run pytest tests/
# With coverage
uv run pytest tests/ --cov=TextSpitter --cov-report=term-missing
Roadmap
- Stream-based API (
BytesIO,SpooledTemporaryFile, rawbytes) - CLI entry point (
uv tool install textspitter) - Optional loguru logging with stdlib fallback
- Programming-language file support (50 + extensions)
- CI matrix (Python 3.12 โ 3.14) + GitHub Pages docs
- Async extraction API
- CSV โ structured output (list of dicts)
- PPTX support
Contributing
- ๐ฌ Join the Discussions: Share insights, give feedback, or ask questions.
- ๐ Report Issues: Submit bugs or log feature requests.
- ๐ก Submit Pull Requests: Review open PRs or submit your own.
Contributing Guidelines
- Fork the Repository: Fork the project to your GitHub account.
- Clone Locally: Clone the forked repository.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message.
git commit -m 'Add new feature x.'
- Push to GitHub: Push the changes to your fork.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against
main. Describe the changes and motivation clearly. - Review: Once approved, your PR will be merged. Thanks for contributing!
License
TextSpitter is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textspitter-1.0.0.tar.gz.
File metadata
- Download URL: textspitter-1.0.0.tar.gz
- Upload date:
- Size: 29.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d6702839cf2ba62b24480795f2b26fe5f92986b4a230991d5c8034534607632
|
|
| MD5 |
cb92f1a72f0d74204c246468ce0bb07d
|
|
| BLAKE2b-256 |
55099506e7cea71d1aa591b1ab898fdf522f1dd2c01ac37f4fed451c19cfd187
|
File details
Details for the file textspitter-1.0.0-py3-none-any.whl.
File metadata
- Download URL: textspitter-1.0.0-py3-none-any.whl
- Upload date:
- Size: 21.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d5ab80d59f6bf2253b302693594cbf1f44158cd9b4427d542ba7370d445cad4
|
|
| MD5 |
73473ad936473580a42b0098c8f49c91
|
|
| BLAKE2b-256 |
13391099bed4c75f6fc2e0952f701c081f3b9eeb11a135dad1fbcf532bcac335
|