Skip to main content

A powerful and automated document parser built with LangChain for intelligent document processing

Project description

Automated Document Parser

PyPI version Python Version CI codecov License: MIT Code style: ruff

A powerful and automated document parser built with LangChain for intelligent document processing. This library automatically detects file types and uses the appropriate loader to parse documents into LangChain-compatible formats.

Features

  • Automatic file type detection based on file extensions
  • Multiple PDF loading methods - 9 different PDF loaders for various use cases
  • Modular architecture - Clean separation with file_load/ and pdf_load/ modules
  • Support for multiple document formats: PDF, TXT, CSV, JSON, DOCX, HTML, Markdown
  • Built on LangChain for seamless integration with RAG applications
  • Type-safe implementation with comprehensive error handling
  • Batch processing support for multiple documents

Supported File Types

Text Files

  • .txt - Plain text files
  • .md - Markdown files

Structured Data

  • .csv - CSV files with encoding support
  • .json - JSON files with jq schema filtering

Documents

  • .docx - Microsoft Word documents
  • .html - HTML files

PDF Files (9 loading methods)

  • pypdf - Basic PDF text extraction (default, no extra dependencies)
  • unstructured - Advanced OCR and layout detection
  • amazon_textract - AWS Textract for high-accuracy OCR
  • mathpix - Specialized for mathematical formulas
  • pdfplumber - High accuracy text and table extraction
  • pypdfium2 - Google PDFium library
  • pymupdf - PyMuPDF (fitz) backend
  • pymupdf4llm - LLM-optimized extraction
  • opendataloader - Advanced multi-format parsing

Installation

Install from PyPI:

pip install automated-document-parser

Or using uv:

uv add automated-document-parser

Quick Start

Basic Usage - Automatic File Type Detection

The primary feature is automatic file type detection. Just point to any supported file and the parser handles the rest:

from automated_document_parser import DocumentParser

# Initialize the parser
parser = DocumentParser()

# Parse any single file - automatically detects type and uses the right loader
documents = parser.parse("document.pdf")        # Auto-detects PDF
documents = parser.parse("data.csv")            # Auto-detects CSV
documents = parser.parse("notes.txt")           # Auto-detects text file

# Parse multiple files of different types - all formats handled automatically
file_paths = ["report.pdf", "data.csv", "notes.txt", "info.docx"]
all_docs = parser.parse_multiple(file_paths)  # Each file auto-detected and loaded

# Access parsed content
for file_path, docs in all_docs.items():
    print(f"File: {file_path}")
    for doc in docs:
        print(f"  Content: {doc.page_content[:100]}...")
        print(f"  Metadata: {doc.metadata}")

Specify Loading Methods

Specify the PDF loading method and other parameters to apply to all files:

from automated_document_parser import DocumentParser

parser = DocumentParser()

# Step 1: Specify the method for PDFs
# Step 2: Parser automatically detects file types and loads them with specified method
file_paths = ["report1.pdf", "report2.pdf", "data.csv", "notes.txt"]
all_docs = parser.parse_multiple(
    file_paths,
    pdf_loader_method="pdfplumber",  # All PDFs will use pdfplumber
    encoding="utf-8"                  # All text files will use UTF-8 encoding
)

# Each file is automatically detected and loaded with the specified settings
for file_path, docs in all_docs.items():
    print(f"Loaded {file_path}: {len(docs)} documents")

Documentation

Full documentation is available at: https://pulkit12dhingra.github.io/automated-document-parser/

Architecture

The library uses a modular architecture:

automated_document_parser/
├── loaders/
│   ├── file_load/          # File loaders module
│   │   ├── base.py         # Base file loader class
│   │   ├── text_loader.py  # Text file loader
│   │   ├── csv_loader.py   # CSV loader
│   │   ├── json_loader.py  # JSON loader
│   │   ├── docx_loader.py  # DOCX loader
│   │   └── html_loader.py  # HTML loader
│   ├── pdf_load/           # PDF loaders module
│   │   ├── base.py         # Base PDF loader class
│   │   ├── pypdf_loader.py
│   │   ├── mathpix_loader.py
│   │   ├── pdfplumber_loader.py
│   │   └── ... (9 PDF loaders total)
│   └── file_loaders.py     # Main orchestrator
└── core.py                 # DocumentParser class

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

automated_document_parser-0.1.6.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

automated_document_parser-0.1.6-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file automated_document_parser-0.1.6.tar.gz.

File metadata

File hashes

Hashes for automated_document_parser-0.1.6.tar.gz
Algorithm Hash digest
SHA256 815785c41cae0cc5a4a0e58130fcb5b450f2bc30f97e5d4de4774d2f86093e55
MD5 bf200122c9123b0b605cbb7537f04ad6
BLAKE2b-256 df5f7be4ea078f423d0faf675c794bb30c4e8fe29aeaaa33964a82cd866b6df8

See more details on using hashes here.

File details

Details for the file automated_document_parser-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for automated_document_parser-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1140d82b84ddaf151beae759d7ea243c77bf00b828d45837d7dcfd98c13f9c26
MD5 086dd334d6eb83a97e55a636f53893fd
BLAKE2b-256 6860fc8353aab2a1e5df398638f1867b7cd5c3e89882c4e0d1aada1e8dfd3e44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page