A text-extraction application that facilitates string consumption.
Project description
TEXTSPITTER.GIT
Transforming documents into insights, effortlessly and efficiently.
Built with the tools and technologies:
Table of Contents
- Table of Contents
- Overview
- Features
- Project Structure
- Getting Started
- Roadmap
- Contributing
- License
- Acknowledgments
Overview
TextSpitter is a powerful developer tool designed to simplify document processing and enhance file handling capabilities across various formats.
Why TextSpitter?
This project streamlines the way developers interact with documents, ensuring a robust and efficient development experience. The core features include:
- ๐ฆ Robust Dependency Management: Ensures a stable development environment with essential libraries for seamless functionality.
- ๐ File Extraction Capabilities: Standardizes handling of text, CSV, DOCX, and PDF files for smooth integration.
- ๐ ๏ธ Enhanced Logging: Utilizes loguru for sophisticated error tracking, improving debugging and maintenance.
- ๐ Automated Publishing: Streamlines the release process with GitHub Actions for continuous delivery.
- ๐ฅ๏ธ Code Quality Tools: Integrates black and ruff for consistent code formatting and linting.
Features
| Component | Details | |
|---|---|---|
| โ๏ธ | Architecture |
|
| ๐ฉ | Code Quality |
|
| ๐ | Documentation |
|
| ๐ | Integrations |
|
| ๐งฉ | Modularity |
|
| ๐งช | Testing |
|
| โก๏ธ | Performance |
|
| ๐ก๏ธ | Security |
|
| ๐ฆ | Dependencies |
|
| ๐ | Scalability |
|
---
## Project Structure
```sh
โโโ TextSpitter.git/
โโโ .github
โ โโโ workflows
โโโ _config.yml
โโโ core_requirements.in
โโโ core_requirements.txt
โโโ dev_requirements.in
โโโ dev_requirements.txt
โโโ LICENSE
โโโ pyproject.toml
โโโ readme-ai.md
โโโ README.md
โโโ requirements.txt
โโโ setup_py.backup
โโโ TextSpitter
โ โโโ __init__.py
โ โโโ core.py
โ โโโ logger.py
โ โโโ main.py
โโโ uv.lock
Project Index
TEXTSPITTER.GIT/
__root__
โฆฟ __root__
File Name Summary core_requirements.in - Defines essential dependencies for the project, ensuring a robust environment for document processing and testing
- By incorporating libraries such as loguru for logging, PyMuPDF and pypdf for PDF manipulation, and python-docx for Word document handling, it streamlines development and enhances functionality
- Additionally, it includes testing frameworks like pytest to facilitate effective testing practices, contributing to overall code quality and reliability.core_requirements.txt - Defines essential dependencies for the project, ensuring that all necessary libraries are available for seamless functionality and testing
- By managing package versions, it facilitates a consistent development environment, supporting various components like logging, document processing, and testing frameworks
- This contributes to the overall stability and reliability of the codebase architecture, enabling efficient development and maintenance processes.dev_requirements.in - Defines development dependencies for the project, ensuring a consistent environment for contributors
- By referencing core requirements and including essential tools like black for code formatting and ruff for linting, it streamlines the setup process
- This facilitates collaboration and enhances code quality across the codebase, ultimately supporting efficient development practices within the overall architecture.dev_requirements.txt - Facilitates the management of development dependencies for the project by specifying required packages and their versions
- This ensures a consistent environment for developers, enhancing collaboration and reducing setup issues
- By automating the generation of this requirements file, it streamlines the process of maintaining and updating dependencies, ultimately supporting the overall architecture of the codebase focused on Jupyter-related functionalities.LICENSE - MIT License facilitates the free use, modification, and distribution of the software, ensuring that users can leverage the codebase without restrictions
- It establishes the legal framework that protects both the authors and users, promoting collaboration and innovation within the project
- By providing this license, the project encourages community engagement while limiting liability for the authors.pyproject.toml - Configuration settings streamline the linting, formatting, and packaging processes for the text-extraction application, TextSpitter
- By defining rules for code quality and style, it ensures consistency and maintainability across the codebase
- Additionally, it specifies project metadata, dependencies, and development tools, facilitating a smooth development experience and enhancing collaboration among contributors.requirements.txt - Manages project dependencies for a Python application by specifying required libraries and their versions
- Ensures compatibility and stability within the codebase, facilitating the installation of essential packages such as lxml, pymupdf, pypdf2, and python-docx
- This structure supports document processing and manipulation functionalities, contributing to the overall architectures efficiency and reliability._config.yml - Configures the Jekyll site to utilize the Cayman theme, enhancing the visual presentation and user experience of the project
- This setup plays a crucial role in defining the overall aesthetic and layout of the website, ensuring a cohesive and appealing design that aligns with the projects branding and purpose within the broader codebase architecture.
TextSpitter
โฆฟ TextSpitter
File Name Summary core.py - FileExtractor serves as a core component for extracting and processing content from various file types, including text, CSV, DOCX, and PDF formats
- It standardizes file handling by providing methods to read and decode file contents while managing different input types
- This functionality enhances the overall architecture by enabling seamless integration of file processing capabilities within the broader application ecosystem.logger.py - Enhancing application reliability through robust logging capabilities, the logger module facilitates a transition from basic print statements to a more sophisticated error capturing mechanism
- By integrating the loguru library, it ensures that error tracking is efficient and organized, ultimately contributing to improved debugging and maintenance across the entire codebase architecture.main.py - WordLoader serves as a central component in the application, facilitating the loading and processing of various file types through its integration with the FileExtractor
- By determining the appropriate extraction method based on file extensions and MIME types, it enhances the systems capability to handle diverse text formats, ensuring a seamless user experience while adhering to object-oriented design principles for future scalability.
.github
โฆฟ .githubworkflows
โฆฟ .github.workflows
File Name Summary python-publish.yml - Automates the process of publishing a Python package to a package registry upon the creation of a release
- By leveraging GitHub Actions, it ensures that the package is built and uploaded seamlessly, enhancing the overall workflow efficiency within the project
- This integration supports continuous delivery practices, allowing for streamlined updates and distribution of the software.
Getting Started
Prerequisites
This project requires the following dependencies:
- Programming Language: Python
- Package Manager: Pip, Uv
Installation
Build TextSpitter.git from the source and intsall dependencies:
-
Clone the repository:
git clone https://github.com/fsecada01/TextSpitter.git
-
Navigate to the project directory:
cd TextSpitter
-
Install the dependencies:
pip install -r core_requirements.txt dev_requirements.txt
Using uv:
uv sync --all-extras --dev
Usage
Run the project with:
Using pip:
python {entrypoint}
Using uv:
uv run python {entrypoint}
Testing
Textspitter.git uses the pytest test framework. Run the test suite with:
Using pip:
pytest
Using uv:
uv run pytest tests/
Roadmap
- spruce up documentation
- Add stream functionality for s3-based file reading
- expand functionality to other file types (e.g., code files, improved CSV handling)
- TDB
Contributing
- ๐ฌ Join the Discussions: Share your insights, provide feedback, or ask questions.
- ๐ Report Issues: Submit bugs found or log feature requests for the
TextSpitter.gitproject. - ๐ก Submit Pull Requests: Review open PRs, and submit your own PRs.
Contributing Guidelines
- Fork the Repository: Start by forking the project repository to your github account.
- Clone Locally: Clone the forked repository to your local machine using a git client.
git clone https://github.com/fsecada01/TextSpitter.git
- Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.'
- Push to github: Push the changes to your forked repository.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
- Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
License
Textspitter.git is protected under the LICENSE License. For more details, refer to the LICENSE file.
Acknowledgments
- Credit
contributors,inspiration,references, etc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file textspitter-0.4.0.tar.gz.
File metadata
- Download URL: textspitter-0.4.0.tar.gz
- Upload date:
- Size: 19.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4177c96ba970c1a1815b144762ebf498267f5fe721a70b86e40688642dcebea3
|
|
| MD5 |
8d41d1fbd493377b274f34ccf66feeb3
|
|
| BLAKE2b-256 |
c51e418502f9a4520422eab8a7a95ef0e4efcfc514882b773c22de8990b603f8
|
File details
Details for the file textspitter-0.4.0-py3-none-any.whl.
File metadata
- Download URL: textspitter-0.4.0-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b568ca4a437ec9cf00d28fe1350389f40a013104de630e094f453aa098bbbed
|
|
| MD5 |
9c1efadf93d51bf65421dd26a9614db5
|
|
| BLAKE2b-256 |
1e86b4bfdae0bc23ef85848a1da6c4e33c4548fb058c8e2275ec33521375b2e2
|