Skip to main content

A Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities

Project description

NeuraDoc

PyPI version Python Version License: MIT

NeuraDoc is a Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities. The library intelligently extracts and classifies content from documents for AI/ML workflows.

Features

  • Multi-format Support: Parse at least 10 different document types (PDF, Word, TXT, etc.)
  • Element Extraction: Extract text, images, tables, and diagrams from documents
  • Classification: Classify document elements by type
  • Smart Positioning: Position and organize extracted elements intelligently
  • LLM Integration: Convert extracted data to LLM-ready formats (tokenized structures)
  • Memory Efficiency: Optimized for processing large documents
  • Configurable Parsing: Control extraction behavior with custom configurations
  • Parsing Profiles: Use predefined profiles for different extraction needs (fast, detailed, etc.)
  • Batch Processing: Process multiple documents with consistent settings
  • Performance Metrics: Get detailed processing statistics and timing information

Supported Document Formats

NeuraDoc supports the following document formats:

  • PDF (.pdf)
  • Microsoft Word (.docx, .doc)
  • Plain Text (.txt)
  • Microsoft Excel (.xlsx, .xls)
  • HTML (.html, .htm)
  • XML (.xml)
  • Images (.jpg, .jpeg, .png, .gif)
  • Microsoft PowerPoint (.pptx, .ppt)
  • CSV (.csv)
  • JSON (.json)
  • Markdown (.md)

Installation

Basic Installation

pip install neuradoc

Installation with Optional Dependencies

# Install with OCR support
pip install neuradoc[ocr]

# Install with advanced table extraction
pip install neuradoc[tables]

# Install with NLP capabilities
pip install neuradoc[nlp]

# Install with transformer model support
pip install neuradoc[transformers]

# Install with web interface
pip install neuradoc[web]

# Install with all optional dependencies
pip install neuradoc[ocr,tables,nlp,transformers,web]

Quick Start

Basic Usage

import neuradoc

# Load and parse a document
doc = neuradoc.load_document("path/to/your/document.pdf")

# Get all text content
text = doc.get_text_content()

# Get tables
tables = doc.get_tables()

# Get images
images = doc.get_images()

# Save extracted content in different formats
doc.save("output.json", format="json")
doc.save("output.md", format="markdown")
doc.save("output.txt", format="text")

Advanced Usage

import neuradoc
from neuradoc.models.element import ElementType
from neuradoc.transformers.llm_transformer import chunk_document

# Load document
doc = neuradoc.load_document("document.docx")

# Filter elements by type
headings = doc.get_elements_by_type(ElementType.HEADING)
code_blocks = doc.get_elements_by_type(ElementType.CODE)

# Transform document into chunks for LLM processing
chunks = chunk_document(doc, max_chunk_size=1000, overlap=100)

# Process chunks with your LLM
for chunk in chunks:
    # Process each chunk with your LLM implementation
    print(f"Chunk: {len(chunk)} characters")

Web Interface

NeuraDoc includes a web interface for document processing:

# Install web dependencies
pip install neuradoc[web]

# Run the web server
python -m neuradoc.web.app

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuradoc-0.1.2.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuradoc-0.1.2-py3-none-any.whl (54.4 kB view details)

Uploaded Python 3

File details

Details for the file neuradoc-0.1.2.tar.gz.

File metadata

  • Download URL: neuradoc-0.1.2.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b441f7a559b3f661bf3dba7cd807d32e1010a4692e0bba8613de0967cf9e6133
MD5 dcd37b431c6b6d08bf7d41c8ca95b933
BLAKE2b-256 5ba2f6f337a6f1e8b8045249b4f02ae5a2a25bafc161fab05c1ab5cd3c784d1d

See more details on using hashes here.

File details

Details for the file neuradoc-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: neuradoc-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 54.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for neuradoc-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 be33fcb5d48536066bf721156632e28ff586ff888ddf2e118ab1f65b3ec35245
MD5 7ac565c86caf8e1a44abad4ec557bdd0
BLAKE2b-256 2273b0eeacc636507bb80131ffd88a9a59cf959a0166a4713d0cae6d90f8380f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page