A text-extraction application that facilitates string consumption.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

TEXTSPITTER.GIT

Transforming documents into insights, effortlessly and efficiently.

Built with the tools and technologies:

Table of Contents

Table of Contents

Overview

Features

Project Structure

Project Index

Getting Started

Prerequisites

Installation

Usage

Testing

Roadmap

Contributing

License

Acknowledgments

Overview

TextSpitter is a powerful developer tool designed to simplify document processing and enhance file handling capabilities across various formats.

Why TextSpitter?

This project streamlines the way developers interact with documents, ensuring a robust and efficient development experience. The core features include:

📦 Robust Dependency Management: Ensures a stable development environment with essential libraries for seamless functionality.

📄 File Extraction Capabilities: Standardizes handling of text, CSV, DOCX, and PDF files for smooth integration.

🛠️ Enhanced Logging: Utilizes loguru for sophisticated error tracking, improving debugging and maintenance.

🚀 Automated Publishing: Streamlines the release process with GitHub Actions for continuous delivery.

🖥️ Code Quality Tools: Integrates black and ruff for consistent code formatting and linting.

Features

Component Details

⚙️ Architecture
Modular design for text processing
Utilizes a pipeline approach for data flow

🔩 Code Quality
Adheres to PEP 8 style guidelines
Includes type hints for better readability

📄 Documentation
Basic README file present
Inline comments for complex functions

🔌 Integrations
CI/CD with GitHub Actions
Package management via pip

🧩 Modularity
Core functionalities separated into modules
Reusable components for text manipulation

🧪 Testing
Unit tests using pytest
Mocking capabilities with pytest-mock

⚡️ Performance
Efficient handling of large text files
Optimized algorithms for text parsing

🛡️ Security
Input validation to prevent injection attacks
Dependencies regularly updated for security patches

📦 Dependencies
Core libraries: pymupdf, lxml, python-docx
Development tools: pytest, loguru

🚀 Scalability
Designed to handle increasing text data volumes
Supports multi-threading for concurrent processing

--- ## Project Structure ```sh └── TextSpitter.git/ ├── .github │ └── workflows ├── _config.yml ├── core_requirements.in ├── core_requirements.txt ├── dev_requirements.in ├── dev_requirements.txt ├── LICENSE ├── pyproject.toml ├── readme-ai.md ├── README.md ├── requirements.txt ├── setup_py.backup ├── TextSpitter │ ├── __init__.py │ ├── core.py │ ├── logger.py │ └── main.py └── uv.lock

Project Index

TEXTSPITTER.GIT/

__root__

⦿ __root__

File Name Summary

core_requirements.in - Defines essential dependencies for the project, ensuring a robust environment for document processing and testing
- By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality
- Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.

core_requirements.txt - Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing
- By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks
- This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.

dev_requirements.in - Defines development dependencies for the project, ensuring a consistent environment for contributors
- By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process
- This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.

dev_requirements.txt - Facilitates the management of development dependencies for the project by specifying required packages and their versions
- This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues
- By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.

LICENSE - MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions
- It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project
- By providing this license, the project encourages community engagement while limiting liability for the authors.

pyproject.toml - Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter
- By defining rules for code quality and style, it ensures consistency and maintainability across the codebase
- Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.

requirements.txt - Manages project dependencies for a Python application by specifying required libraries and their versions
- Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx
- This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability.

_config.yml - Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project
- This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.

TextSpitter

⦿ TextSpitter

File Name Summary

core.py - FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats
- It standardizes file handling by providing methods to read and decode file contents while managing different input types
- This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.

logger.py - Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism
- By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.

main.py - WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor
- By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.

.github

⦿ .github

workflows

⦿ .github.workflows

File Name Summary

python-publish.yml - Automates the process of publishing a Python package to a package registry upon the creation of a release
- By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project
- This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.

Getting Started

Prerequisites

This project requires the following dependencies:

Programming Language: Python

Package Manager: Pip, Uv

Installation

Build TextSpitter.git from the source and intsall dependencies:

Clone the repository:

git clone https://github.com/fsecada01/TextSpitter.git

Navigate to the project directory:

cd TextSpitter

Install the dependencies:

pip install -r core_requirements.txt dev_requirements.txt

Using uv:

uv sync --all-extras --dev

Usage

Run the project with:

Using pip:

python {entrypoint}

Using uv:

uv run python {entrypoint}

Testing

Textspitter.git uses the pytest test framework. Run the test suite with:

Using pip:

pytest

Using uv:

uv run pytest tests/

Roadmap

spruce up documentation

Add stream functionality for s3-based file reading

expand functionality to other file types (e.g., code files, improved CSV handling)

TDB

Contributing

💬 Join the Discussions: Share your insights, provide feedback, or ask questions.

🐛 Report Issues: Submit bugs found or log feature requests for the TextSpitter.git project.

💡 Submit Pull Requests: Review open PRs, and submit your own PRs.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your github account.

Clone Locally: Clone the forked repository to your local machine using a git client.
git clone https://github.com/fsecada01/TextSpitter.git

Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x

Make Your Changes: Develop and test your changes locally.

Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.'

Push to github: Push the changes to your forked repository.
git push origin new-feature-x

Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.

Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

License

Textspitter.git is protected under the LICENSE License. For more details, refer to the LICENSE file.

Acknowledgments

Credit contributors, inspiration, references, etc.

	Component	Details
⚙️	Architecture	Modular design for text processing Utilizes a pipeline approach for data flow
🔩	Code Quality	Adheres to PEP 8 style guidelines Includes type hints for better readability
📄	Documentation	Basic README file present Inline comments for complex functions
🔌	Integrations	CI/CD with GitHub Actions Package management via pip
🧩	Modularity	Core functionalities separated into modules Reusable components for text manipulation
🧪	Testing	Unit tests using pytest Mocking capabilities with pytest-mock
⚡️	Performance	Efficient handling of large text files Optimized algorithms for text parsing
🛡️	Security	Input validation to prevent injection attacks Dependencies regularly updated for security patches
📦	Dependencies	Core libraries: `pymupdf`, `lxml`, `python-docx` Development tools: `pytest`, `loguru`
🚀	Scalability	Designed to handle increasing text data volumes Supports multi-threading for concurrent processing

File Name	Summary
core_requirements.in	- Defines essential dependencies for the project, ensuring a robust environment for document processing and testing - By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality - Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.
core_requirements.txt	- Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing - By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks - This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.
dev_requirements.in	- Defines development dependencies for the project, ensuring a consistent environment for contributors - By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process - This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.
dev_requirements.txt	- Facilitates the management of development dependencies for the project by specifying required packages and their versions - This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues - By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.
LICENSE	- MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions - It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project - By providing this license, the project encourages community engagement while limiting liability for the authors.
pyproject.toml	- Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter - By defining rules for code quality and style, it ensures consistency and maintainability across the codebase - Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.
requirements.txt	- Manages project dependencies for a Python application by specifying required libraries and their versions - Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx - This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability.
_config.yml	- Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project - This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.

File Name	Summary
core.py	- FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats - It standardizes file handling by providing methods to read and decode file contents while managing different input types - This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.
logger.py	- Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism - By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.
main.py	- WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor - By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.

File Name	Summary
python-publish.yml	- Automates the process of publishing a Python package to a package registry upon the creation of a release - By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project - This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.4.0

Jun 30, 2025

0.4.0b2 pre-release

Jun 11, 2025

0.4.0b1 pre-release yanked

Jun 11, 2025

Reason this release was yanked:

dependency version conflicts

0.4.0b0 pre-release yanked

Jun 11, 2025

Reason this release was yanked:

version clashing

0.3.7rc4 pre-release

Feb 11, 2025

0.3.7rc3 pre-release

Feb 11, 2025

0.3.7rc2 pre-release

Feb 11, 2025

0.3.7rc1 pre-release yanked

Feb 11, 2025

Reason this release was yanked:

bug that breaks implementation of `file_attr` key

0.3.7b2 pre-release

Dec 14, 2024

0.3.7b1 pre-release

Nov 26, 2024

0.3.7b0 pre-release

Sep 23, 2024

0.3.6

Nov 10, 2021

0.3.5

Oct 10, 2021

0.3.5a5 pre-release

Oct 10, 2021

0.3.5a4 pre-release

Oct 10, 2021

0.3.5a3 pre-release

Oct 10, 2021

0.3.5a2 pre-release

Oct 5, 2021

0.3.5a1 pre-release

Oct 5, 2021

0.3.5a0 pre-release

Oct 10, 2021

0.3.4

Sep 26, 2021

0.3.3.post0

Apr 20, 2021

0.3.3

Apr 20, 2021

0.3.2

Nov 24, 2019

0.3.1

Dec 4, 2018

0.3

Nov 23, 2018

0.2.post0

Nov 12, 2018

0.2

Nov 12, 2018

0.1

Nov 8, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textspitter-0.4.0.tar.gz (19.9 kB view details)

Uploaded Jun 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textspitter-0.4.0-py3-none-any.whl (14.0 kB view details)

Uploaded Jun 30, 2025 Python 3

File details

Details for the file textspitter-0.4.0.tar.gz.

File metadata

Download URL: textspitter-0.4.0.tar.gz
Upload date: Jun 30, 2025
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for textspitter-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`4177c96ba970c1a1815b144762ebf498267f5fe721a70b86e40688642dcebea3`
MD5	`8d41d1fbd493377b274f34ccf66feeb3`
BLAKE2b-256	`c51e418502f9a4520422eab8a7a95ef0e4efcfc514882b773c22de8990b603f8`

See more details on using hashes here.

File details

Details for the file textspitter-0.4.0-py3-none-any.whl.

File metadata

Download URL: textspitter-0.4.0-py3-none-any.whl
Upload date: Jun 30, 2025
Size: 14.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for textspitter-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b568ca4a437ec9cf00d28fe1350389f40a013104de630e094f453aa098bbbed`
MD5	`9c1efadf93d51bf65421dd26a9614db5`
BLAKE2b-256	`1e86b4bfdae0bc23ef85848a1da6c4e33c4548fb058c8e2275ec33521375b2e2`

See more details on using hashes here.

textspitter 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TEXTSPITTER.GIT

Table of Contents

Overview

Features

Project Index

Getting Started

Prerequisites

Installation

Usage

Testing

Roadmap

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes