Skip to main content

A text-extraction application that facilitates string consumption.

Project description

Project Logo

TEXTSPITTER.GIT

Transforming documents into insights, effortlessly and efficiently.

license last-commit repo-top-language repo-language-count

Built with the tools and technologies:

TOML Pytest Python GitHub%20Actions uv


Table of Contents


Overview

TextSpitter is a powerful developer tool designed to simplify document processing and enhance file handling capabilities across various formats.

Why TextSpitter?

This project streamlines the way developers interact with documents, ensuring a robust and efficient development experience. The core features include:

  • ๐Ÿ“ฆ Robust Dependency Management: Ensures a stable development environment with essential libraries for seamless functionality.
  • ๐Ÿ“„ File Extraction Capabilities: Standardizes handling of text, CSV, DOCX, and PDF files for smooth integration.
  • ๐Ÿ› ๏ธ Enhanced Logging: Utilizes loguru for sophisticated error tracking, improving debugging and maintenance.
  • ๐Ÿš€ Automated Publishing: Streamlines the release process with GitHub Actions for continuous delivery.
  • ๐Ÿ–ฅ๏ธ Code Quality Tools: Integrates black and ruff for consistent code formatting and linting.

Features

Component Details
โš™๏ธ Architecture
  • Modular design for text processing
  • Utilizes a pipeline approach for data flow
๐Ÿ”ฉ Code Quality
  • Adheres to PEP 8 style guidelines
  • Includes type hints for better readability
๐Ÿ“„ Documentation
  • Basic README file present
  • Inline comments for complex functions
๐Ÿ”Œ Integrations
  • CI/CD with GitHub Actions
  • Package management via pip
๐Ÿงฉ Modularity
  • Core functionalities separated into modules
  • Reusable components for text manipulation
๐Ÿงช Testing
  • Unit tests using pytest
  • Mocking capabilities with pytest-mock
โšก๏ธ Performance
  • Efficient handling of large text files
  • Optimized algorithms for text parsing
๐Ÿ›ก๏ธ Security
  • Input validation to prevent injection attacks
  • Dependencies regularly updated for security patches
๐Ÿ“ฆ Dependencies
  • Core libraries: pymupdf, lxml, python-docx
  • Development tools: pytest, loguru
๐Ÿš€ Scalability
  • Designed to handle increasing text data volumes
  • Supports multi-threading for concurrent processing

---

## Project Structure

```sh
โ””โ”€โ”€ TextSpitter.git/
    โ”œโ”€โ”€ .github
    โ”‚   โ””โ”€โ”€ workflows
    โ”œโ”€โ”€ _config.yml
    โ”œโ”€โ”€ core_requirements.in
    โ”œโ”€โ”€ core_requirements.txt
    โ”œโ”€โ”€ dev_requirements.in
    โ”œโ”€โ”€ dev_requirements.txt
    โ”œโ”€โ”€ LICENSE
    โ”œโ”€โ”€ pyproject.toml
    โ”œโ”€โ”€ readme-ai.md
    โ”œโ”€โ”€ README.md
    โ”œโ”€โ”€ requirements.txt
    โ”œโ”€โ”€ setup_py.backup
    โ”œโ”€โ”€ TextSpitter
    โ”‚   โ”œโ”€โ”€ __init__.py
    โ”‚   โ”œโ”€โ”€ core.py
    โ”‚   โ”œโ”€โ”€ logger.py
    โ”‚   โ””โ”€โ”€ main.py
    โ””โ”€โ”€ uv.lock

Project Index

TEXTSPITTER.GIT/
__root__
โฆฟ __root__
File Name Summary
core_requirements.in - Defines essential dependencies for the project, ensuring a robust environment for document processing and testing
- By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality
- Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.
core_requirements.txt - Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing
- By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks
- This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.
dev_requirements.in - Defines development dependencies for the project, ensuring a consistent environment for contributors
- By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process
- This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.
dev_requirements.txt - Facilitates the management of development dependencies for the project by specifying required packages and their versions
- This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues
- By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.
LICENSE - MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions
- It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project
- By providing this license, the project encourages community engagement while limiting liability for the authors.
pyproject.toml - Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter
- By defining rules for code quality and style, it ensures consistency and maintainability across the codebase
- Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.
requirements.txt - Manages project dependencies for a Python application by specifying required libraries and their versions
- Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx
- This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability.
_config.yml - Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project
- This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.
TextSpitter
โฆฟ TextSpitter
File Name Summary
core.py - FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats
- It standardizes file handling by providing methods to read and decode file contents while managing different input types
- This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.
logger.py - Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism
- By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.
main.py - WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor
- By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.
.github
โฆฟ .github
workflows
โฆฟ .github.workflows
File Name Summary
python-publish.yml - Automates the process of publishing a Python package to a package registry upon the creation of a release
- By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project
- This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.

Getting Started

Prerequisites

This project requires the following dependencies:

  • Programming Language: Python
  • Package Manager: Pip, Uv

Installation

Build TextSpitter.git from the source and intsall dependencies:

  1. Clone the repository:

    git clone https://github.com/fsecada01/TextSpitter.git
    
  2. Navigate to the project directory:

    cd TextSpitter
    
  3. Install the dependencies:

    pip install -r core_requirements.txt dev_requirements.txt
    

    Using uv:

    uv sync --all-extras --dev
    

Usage

Run the project with:

Using pip:

python {entrypoint}

Using uv:

uv run python {entrypoint}

Testing

Textspitter.git uses the pytest test framework. Run the test suite with:

Using pip:

pytest

Using uv:

uv run pytest tests/

Roadmap

  • spruce up documentation
  • Add stream functionality for s3-based file reading
  • expand functionality to other file types (e.g., code files, improved CSV handling)
  • TDB

Contributing

  • ๐Ÿ’ฌ Join the Discussions: Share your insights, provide feedback, or ask questions.
  • ๐Ÿ› Report Issues: Submit bugs found or log feature requests for the TextSpitter.git project.
  • ๐Ÿ’ก Submit Pull Requests: Review open PRs, and submit your own PRs.
Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your github account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone https://github.com/fsecada01/TextSpitter.git
    
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
    
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
    
  6. Push to github: Push the changes to your forked repository.
    git push origin new-feature-x
    
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


License

Textspitter.git is protected under the LICENSE License. For more details, refer to the LICENSE file.


Acknowledgments

  • Credit contributors, inspiration, references, etc.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textspitter-0.4.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textspitter-0.4.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file textspitter-0.4.0.tar.gz.

File metadata

  • Download URL: textspitter-0.4.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for textspitter-0.4.0.tar.gz
Algorithm Hash digest
SHA256 4177c96ba970c1a1815b144762ebf498267f5fe721a70b86e40688642dcebea3
MD5 8d41d1fbd493377b274f34ccf66feeb3
BLAKE2b-256 c51e418502f9a4520422eab8a7a95ef0e4efcfc514882b773c22de8990b603f8

See more details on using hashes here.

File details

Details for the file textspitter-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: textspitter-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for textspitter-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8b568ca4a437ec9cf00d28fe1350389f40a013104de630e094f453aa098bbbed
MD5 9c1efadf93d51bf65421dd26a9614db5
BLAKE2b-256 1e86b4bfdae0bc23ef85848a1da6c4e33c4548fb058c8e2275ec33521375b2e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page