Skip to main content

A Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities

Project description

NeuraDoc

PyPI version Python Version License: MIT

NeuraDoc is a Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities. The library intelligently extracts and classifies content from documents for AI/ML workflows.

Features

  • Multi-format Support: Parse at least 10 different document types (PDF, Word, TXT, etc.)
  • Element Extraction: Extract text, images, tables, and diagrams from documents
  • Classification: Classify document elements by type
  • Smart Positioning: Position and organize extracted elements intelligently
  • LLM Integration: Convert extracted data to LLM-ready formats (tokenized structures)
  • Memory Efficiency: Optimized for processing large documents

Supported Document Formats

NeuraDoc supports the following document formats:

  • PDF (.pdf)
  • Microsoft Word (.docx, .doc)
  • Plain Text (.txt)
  • Microsoft Excel (.xlsx, .xls)
  • HTML (.html, .htm)
  • XML (.xml)
  • Images (.jpg, .jpeg, .png, .gif)
  • Microsoft PowerPoint (.pptx, .ppt)
  • CSV (.csv)
  • JSON (.json)
  • Markdown (.md)

Installation

Basic Installation

pip install neuradoc

Installation with Optional Dependencies

# Install with OCR support
pip install neuradoc[ocr]

# Install with advanced table extraction
pip install neuradoc[tables]

# Install with NLP capabilities
pip install neuradoc[nlp]

# Install with transformer model support
pip install neuradoc[transformers]

# Install with web interface
pip install neuradoc[web]

# Install with all optional dependencies
pip install neuradoc[ocr,tables,nlp,transformers,web]

Quick Start

Basic Usage

import neuradoc

# Load and parse a document
doc = neuradoc.load_document("path/to/your/document.pdf")

# Get all text content
text = doc.get_text_content()

# Get tables
tables = doc.get_tables()

# Get images
images = doc.get_images()

# Save extracted content in different formats
doc.save("output.json", format="json")
doc.save("output.md", format="markdown")
doc.save("output.txt", format="text")

Advanced Usage

import neuradoc
from neuradoc.models.element import ElementType
from neuradoc.transformers.llm_transformer import chunk_document

# Load document
doc = neuradoc.load_document("document.docx")

# Filter elements by type
headings = doc.get_elements_by_type(ElementType.HEADING)
code_blocks = doc.get_elements_by_type(ElementType.CODE)

# Transform document into chunks for LLM processing
chunks = chunk_document(doc, max_chunk_size=1000, overlap=100)

# Process chunks with your LLM
for chunk in chunks:
    # Process each chunk with your LLM implementation
    print(f"Chunk: {len(chunk)} characters")

Web Interface

NeuraDoc includes a web interface for document processing:

# Install web dependencies
pip install neuradoc[web]

# Run the web server
python -m neuradoc.web.app

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuradoc-0.1.0.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuradoc-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file neuradoc-0.1.0.tar.gz.

File metadata

  • Download URL: neuradoc-0.1.0.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9c4714292342f06458b27b4ad5befd29107aaeebb6fe52b51ca85ef3c444f737
MD5 0fbe270a632d875fa1ae864bf2c1b096
BLAKE2b-256 5c9977f3b13c037888fd7bc18a9f605bf0add1914e97d5d7f6cfef1f7b64fc56

See more details on using hashes here.

File details

Details for the file neuradoc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: neuradoc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a82dedc8ed4bc83374162ba346c5cc1f59c563e8fef81cfd49f42d27540d9fb3
MD5 9a295fd4f7a1092712ce528b999c8965
BLAKE2b-256 701b631c35fe320f04a55efc0ec65d6bd5ad97d69d0022531ceda8db60c154a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page