A Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities

These details have not been verified by PyPI

Project links

Project description

NeuraDoc

NeuraDoc is a Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities. The library intelligently extracts and classifies content from documents for AI/ML workflows.

Features

Multi-format Support: Parse at least 10 different document types (PDF, Word, TXT, etc.)
Element Extraction: Extract text, images, tables, and diagrams from documents
Classification: Classify document elements by type
Smart Positioning: Position and organize extracted elements intelligently
LLM Integration: Convert extracted data to LLM-ready formats (tokenized structures)
Memory Efficiency: Optimized for processing large documents
Configurable Parsing: Control extraction behavior with custom configurations
Parsing Profiles: Use predefined profiles for different extraction needs (fast, detailed, etc.)
Batch Processing: Process multiple documents with consistent settings
Performance Metrics: Get detailed processing statistics and timing information

Supported Document Formats

NeuraDoc supports the following document formats:

PDF (.pdf)
Microsoft Word (.docx, .doc)
Plain Text (.txt)
Microsoft Excel (.xlsx, .xls)
HTML (.html, .htm)
XML (.xml)
Images (.jpg, .jpeg, .png, .gif)
Microsoft PowerPoint (.pptx, .ppt)
CSV (.csv)
JSON (.json)
Markdown (.md)

Installation

Basic Installation

pip install neuradoc

Installation with Optional Dependencies

# Install with OCR support
pip install neuradoc[ocr]

# Install with advanced table extraction
pip install neuradoc[tables]

# Install with NLP capabilities
pip install neuradoc[nlp]

# Install with transformer model support
pip install neuradoc[transformers]

# Install with web interface
pip install neuradoc[web]

# Install with all optional dependencies
pip install neuradoc[ocr,tables,nlp,transformers,web]

Quick Start

Basic Usage

import neuradoc

# Load and parse a document
doc = neuradoc.load_document("path/to/your/document.pdf")

# Get all text content
text = doc.get_text_content()

# Get tables
tables = doc.get_tables()

# Get images
images = doc.get_images()

# Save extracted content in different formats
doc.save("output.json", format="json")
doc.save("output.md", format="markdown")
doc.save("output.txt", format="text")

Advanced Usage

import neuradoc
from neuradoc.models.element import ElementType
from neuradoc.transformers.llm_transformer import chunk_document

# Load document
doc = neuradoc.load_document("document.docx")

# Filter elements by type
headings = doc.get_elements_by_type(ElementType.HEADING)
code_blocks = doc.get_elements_by_type(ElementType.CODE)

# Transform document into chunks for LLM processing
chunks = chunk_document(doc, max_chunk_size=1000, overlap=100)

# Process chunks with your LLM
for chunk in chunks:
    # Process each chunk with your LLM implementation
    print(f"Chunk: {len(chunk)} characters")

Web Interface

NeuraDoc includes a web interface for document processing:

# Install web dependencies
pip install neuradoc[web]

# Run the web server
python -m neuradoc.web.app

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 6, 2025

0.1.1

Apr 5, 2025

0.1.0

Apr 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuradoc-0.1.2.tar.gz (46.8 kB view details)

Uploaded Apr 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neuradoc-0.1.2-py3-none-any.whl (54.4 kB view details)

Uploaded Apr 6, 2025 Python 3

File details

Details for the file neuradoc-0.1.2.tar.gz.

File metadata

Download URL: neuradoc-0.1.2.tar.gz
Upload date: Apr 6, 2025
Size: 46.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`b441f7a559b3f661bf3dba7cd807d32e1010a4692e0bba8613de0967cf9e6133`
MD5	`dcd37b431c6b6d08bf7d41c8ca95b933`
BLAKE2b-256	`5ba2f6f337a6f1e8b8045249b4f02ae5a2a25bafc161fab05c1ab5cd3c784d1d`

See more details on using hashes here.

File details

Details for the file neuradoc-0.1.2-py3-none-any.whl.

File metadata

Download URL: neuradoc-0.1.2-py3-none-any.whl
Upload date: Apr 6, 2025
Size: 54.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be33fcb5d48536066bf721156632e28ff586ff888ddf2e118ab1f65b3ec35245`
MD5	`7ac565c86caf8e1a44abad4ec557bdd0`
BLAKE2b-256	`2273b0eeacc636507bb80131ffd88a9a59cf959a0166a4713d0cae6d90f8380f`

See more details on using hashes here.

neuradoc 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NeuraDoc

Features

Supported Document Formats

Installation

Basic Installation

Installation with Optional Dependencies

Quick Start

Basic Usage

Advanced Usage

Web Interface

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes