A powerful and automated document parser built with LangChain for intelligent document processing

These details have not been verified by PyPI

Project links

Project description

Automated Document Parser

A powerful and automated document parser built with LangChain for intelligent document processing. This library automatically detects file types and uses the appropriate loader to parse documents into LangChain-compatible formats.

Features

Automatic file type detection based on file extensions
Multiple PDF loading methods - 9 different PDF loaders for various use cases
Modular architecture - Clean separation with file_load/ and pdf_load/ modules
Support for multiple document formats: PDF, TXT, CSV, JSON, DOCX, HTML, Markdown
Built on LangChain for seamless integration with RAG applications
Type-safe implementation with comprehensive error handling
Batch processing support for multiple documents

Supported File Types

Text Files

.txt - Plain text files
.md - Markdown files

Structured Data

.csv - CSV files with encoding support
.json - JSON files with jq schema filtering

Documents

.docx - Microsoft Word documents
.html - HTML files

PDF Files (9 loading methods)

pypdf - Basic PDF text extraction (default, no extra dependencies)
unstructured - Advanced OCR and layout detection
amazon_textract - AWS Textract for high-accuracy OCR
mathpix - Specialized for mathematical formulas
pdfplumber - High accuracy text and table extraction
pypdfium2 - Google PDFium library
pymupdf - PyMuPDF (fitz) backend
pymupdf4llm - LLM-optimized extraction
opendataloader - Advanced multi-format parsing

Installation

Install from PyPI:

pip install automated-document-parser

Or using uv:

uv add automated-document-parser

Quick Start

Basic Usage - Automatic File Type Detection

The primary feature is automatic file type detection. Just point to any supported file and the parser handles the rest:

from automated_document_parser import DocumentParser

# Initialize the parser
parser = DocumentParser()

# Parse any single file - automatically detects type and uses the right loader
documents = parser.parse("document.pdf")        # Auto-detects PDF
documents = parser.parse("data.csv")            # Auto-detects CSV
documents = parser.parse("notes.txt")           # Auto-detects text file

# Parse multiple files of different types - all formats handled automatically
file_paths = ["report.pdf", "data.csv", "notes.txt", "info.docx"]
all_docs = parser.parse_multiple(file_paths)  # Each file auto-detected and loaded

# Access parsed content
for file_path, docs in all_docs.items():
    print(f"File: {file_path}")
    for doc in docs:
        print(f"  Content: {doc.page_content[:100]}...")
        print(f"  Metadata: {doc.metadata}")

Specify Loading Methods

Specify the PDF loading method and other parameters to apply to all files:

from automated_document_parser import DocumentParser

parser = DocumentParser()

# Step 1: Specify the method for PDFs
# Step 2: Parser automatically detects file types and loads them with specified method
file_paths = ["report1.pdf", "report2.pdf", "data.csv", "notes.txt"]
all_docs = parser.parse_multiple(
    file_paths,
    pdf_loader_method="pdfplumber",  # All PDFs will use pdfplumber
    encoding="utf-8"                  # All text files will use UTF-8 encoding
)

# Each file is automatically detected and loaded with the specified settings
for file_path, docs in all_docs.items():
    print(f"Loaded {file_path}: {len(docs)} documents")

Documentation

Full documentation is available at: https://pulkit12dhingra.github.io/automated-document-parser/

Architecture

The library uses a modular architecture:

automated_document_parser/
├── loaders/
│   ├── file_load/          # File loaders module
│   │   ├── base.py         # Base file loader class
│   │   ├── text_loader.py  # Text file loader
│   │   ├── csv_loader.py   # CSV loader
│   │   ├── json_loader.py  # JSON loader
│   │   ├── docx_loader.py  # DOCX loader
│   │   └── html_loader.py  # HTML loader
│   ├── pdf_load/           # PDF loaders module
│   │   ├── base.py         # Base PDF loader class
│   │   ├── pypdf_loader.py
│   │   ├── mathpix_loader.py
│   │   ├── pdfplumber_loader.py
│   │   └── ... (9 PDF loaders total)
│   └── file_loaders.py     # Main orchestrator
└── core.py                 # DocumentParser class

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.6

Nov 9, 2025

0.1.5

Nov 9, 2025

0.1.4

Nov 7, 2025

0.1.3

Nov 7, 2025

0.1.2

Nov 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

automated_document_parser-0.1.6.tar.gz (35.6 kB view details)

Uploaded Nov 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

automated_document_parser-0.1.6-py3-none-any.whl (29.3 kB view details)

Uploaded Nov 9, 2025 Python 3

File details

Details for the file automated_document_parser-0.1.6.tar.gz.

File metadata

Download URL: automated_document_parser-0.1.6.tar.gz
Upload date: Nov 9, 2025
Size: 35.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for automated_document_parser-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`815785c41cae0cc5a4a0e58130fcb5b450f2bc30f97e5d4de4774d2f86093e55`
MD5	`bf200122c9123b0b605cbb7537f04ad6`
BLAKE2b-256	`df5f7be4ea078f423d0faf675c794bb30c4e8fe29aeaaa33964a82cd866b6df8`

See more details on using hashes here.

File details

Details for the file automated_document_parser-0.1.6-py3-none-any.whl.

File metadata

Download URL: automated_document_parser-0.1.6-py3-none-any.whl
Upload date: Nov 9, 2025
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for automated_document_parser-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1140d82b84ddaf151beae759d7ea243c77bf00b828d45837d7dcfd98c13f9c26`
MD5	`086dd334d6eb83a97e55a636f53893fd`
BLAKE2b-256	`6860fc8353aab2a1e5df398638f1867b7cd5c3e89882c4e0d1aada1e8dfd3e44`

See more details on using hashes here.

automated-document-parser 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Automated Document Parser

Features

Supported File Types

Text Files

Structured Data

Documents

PDF Files (9 loading methods)

Installation

Quick Start

Basic Usage - Automatic File Type Detection

Specify Loading Methods

Documentation

Architecture

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes