A Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities
Project description
NeuraDoc
NeuraDoc is a Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities. The library intelligently extracts and classifies content from documents for AI/ML workflows.
Features
- Multi-format Support: Parse at least 10 different document types (PDF, Word, TXT, etc.)
- Element Extraction: Extract text, images, tables, and diagrams from documents
- Classification: Classify document elements by type
- Smart Positioning: Position and organize extracted elements intelligently
- LLM Integration: Convert extracted data to LLM-ready formats (tokenized structures)
- Memory Efficiency: Optimized for processing large documents
- Configurable Parsing: Control extraction behavior with custom configurations
- Parsing Profiles: Use predefined profiles for different extraction needs (fast, detailed, etc.)
- Batch Processing: Process multiple documents with consistent settings
- Performance Metrics: Get detailed processing statistics and timing information
Supported Document Formats
NeuraDoc supports the following document formats:
- PDF (
.pdf) - Microsoft Word (
.docx,.doc) - Plain Text (
.txt) - Microsoft Excel (
.xlsx,.xls) - HTML (
.html,.htm) - XML (
.xml) - Images (
.jpg,.jpeg,.png,.gif) - Microsoft PowerPoint (
.pptx,.ppt) - CSV (
.csv) - JSON (
.json) - Markdown (
.md)
Installation
Basic Installation
pip install neuradoc
Installation with Optional Dependencies
# Install with OCR support
pip install neuradoc[ocr]
# Install with advanced table extraction
pip install neuradoc[tables]
# Install with NLP capabilities
pip install neuradoc[nlp]
# Install with transformer model support
pip install neuradoc[transformers]
# Install with web interface
pip install neuradoc[web]
# Install with all optional dependencies
pip install neuradoc[ocr,tables,nlp,transformers,web]
Quick Start
Basic Usage
import neuradoc
# Load and parse a document
doc = neuradoc.load_document("path/to/your/document.pdf")
# Get all text content
text = doc.get_text_content()
# Get tables
tables = doc.get_tables()
# Get images
images = doc.get_images()
# Save extracted content in different formats
doc.save("output.json", format="json")
doc.save("output.md", format="markdown")
doc.save("output.txt", format="text")
Advanced Usage
import neuradoc
from neuradoc.models.element import ElementType
from neuradoc.transformers.llm_transformer import chunk_document
# Load document
doc = neuradoc.load_document("document.docx")
# Filter elements by type
headings = doc.get_elements_by_type(ElementType.HEADING)
code_blocks = doc.get_elements_by_type(ElementType.CODE)
# Transform document into chunks for LLM processing
chunks = chunk_document(doc, max_chunk_size=1000, overlap=100)
# Process chunks with your LLM
for chunk in chunks:
# Process each chunk with your LLM implementation
print(f"Chunk: {len(chunk)} characters")
Web Interface
NeuraDoc includes a web interface for document processing:
# Install web dependencies
pip install neuradoc[web]
# Run the web server
python -m neuradoc.web.app
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file neuradoc-0.1.2.tar.gz.
File metadata
- Download URL: neuradoc-0.1.2.tar.gz
- Upload date:
- Size: 46.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b441f7a559b3f661bf3dba7cd807d32e1010a4692e0bba8613de0967cf9e6133
|
|
| MD5 |
dcd37b431c6b6d08bf7d41c8ca95b933
|
|
| BLAKE2b-256 |
5ba2f6f337a6f1e8b8045249b4f02ae5a2a25bafc161fab05c1ab5cd3c784d1d
|
File details
Details for the file neuradoc-0.1.2-py3-none-any.whl.
File metadata
- Download URL: neuradoc-0.1.2-py3-none-any.whl
- Upload date:
- Size: 54.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be33fcb5d48536066bf721156632e28ff586ff888ddf2e118ab1f65b3ec35245
|
|
| MD5 |
7ac565c86caf8e1a44abad4ec557bdd0
|
|
| BLAKE2b-256 |
2273b0eeacc636507bb80131ffd88a9a59cf959a0166a4713d0cae6d90f8380f
|