A powerful and automated document parser built with LangChain for intelligent document processing
Project description
Automated Document Parser
A powerful and automated document parser built with LangChain for intelligent document processing. This library automatically detects file types and uses the appropriate loader to parse documents into LangChain-compatible formats.
Features
- Automatic file type detection based on file extensions
- Multiple PDF loading methods - 9 different PDF loaders for various use cases
- Modular architecture - Clean separation with
file_load/andpdf_load/modules - Support for multiple document formats: PDF, TXT, CSV, JSON, DOCX, HTML, Markdown
- Built on LangChain for seamless integration with RAG applications
- Type-safe implementation with comprehensive error handling
- Batch processing support for multiple documents
Supported File Types
Text Files
.txt- Plain text files.md- Markdown files
Structured Data
.csv- CSV files with encoding support.json- JSON files with jq schema filtering
Documents
.docx- Microsoft Word documents.html- HTML files
PDF Files (9 loading methods)
pypdf- Basic PDF text extraction (default, no extra dependencies)unstructured- Advanced OCR and layout detectionamazon_textract- AWS Textract for high-accuracy OCRmathpix- Specialized for mathematical formulaspdfplumber- High accuracy text and table extractionpypdfium2- Google PDFium librarypymupdf- PyMuPDF (fitz) backendpymupdf4llm- LLM-optimized extractionopendataloader- Advanced multi-format parsing
Installation
Install from PyPI:
pip install automated-document-parser
Or using uv:
uv add automated-document-parser
Quick Start
Basic Usage - Automatic File Type Detection
The primary feature is automatic file type detection. Just point to any supported file and the parser handles the rest:
from automated_document_parser import DocumentParser
# Initialize the parser
parser = DocumentParser()
# Parse any single file - automatically detects type and uses the right loader
documents = parser.parse("document.pdf") # Auto-detects PDF
documents = parser.parse("data.csv") # Auto-detects CSV
documents = parser.parse("notes.txt") # Auto-detects text file
# Parse multiple files of different types - all formats handled automatically
file_paths = ["report.pdf", "data.csv", "notes.txt", "info.docx"]
all_docs = parser.parse_multiple(file_paths) # Each file auto-detected and loaded
# Access parsed content
for file_path, docs in all_docs.items():
print(f"File: {file_path}")
for doc in docs:
print(f" Content: {doc.page_content[:100]}...")
print(f" Metadata: {doc.metadata}")
Specify Loading Methods
Specify the PDF loading method and other parameters to apply to all files:
from automated_document_parser import DocumentParser
parser = DocumentParser()
# Step 1: Specify the method for PDFs
# Step 2: Parser automatically detects file types and loads them with specified method
file_paths = ["report1.pdf", "report2.pdf", "data.csv", "notes.txt"]
all_docs = parser.parse_multiple(
file_paths,
pdf_loader_method="pdfplumber", # All PDFs will use pdfplumber
encoding="utf-8" # All text files will use UTF-8 encoding
)
# Each file is automatically detected and loaded with the specified settings
for file_path, docs in all_docs.items():
print(f"Loaded {file_path}: {len(docs)} documents")
Documentation
Full documentation is available at: https://pulkit12dhingra.github.io/automated-document-parser/
Architecture
The library uses a modular architecture:
automated_document_parser/
├── loaders/
│ ├── file_load/ # File loaders module
│ │ ├── base.py # Base file loader class
│ │ ├── text_loader.py # Text file loader
│ │ ├── csv_loader.py # CSV loader
│ │ ├── json_loader.py # JSON loader
│ │ ├── docx_loader.py # DOCX loader
│ │ └── html_loader.py # HTML loader
│ ├── pdf_load/ # PDF loaders module
│ │ ├── base.py # Base PDF loader class
│ │ ├── pypdf_loader.py
│ │ ├── mathpix_loader.py
│ │ ├── pdfplumber_loader.py
│ │ └── ... (9 PDF loaders total)
│ └── file_loaders.py # Main orchestrator
└── core.py # DocumentParser class
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file automated_document_parser-0.1.6.tar.gz.
File metadata
- Download URL: automated_document_parser-0.1.6.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
815785c41cae0cc5a4a0e58130fcb5b450f2bc30f97e5d4de4774d2f86093e55
|
|
| MD5 |
bf200122c9123b0b605cbb7537f04ad6
|
|
| BLAKE2b-256 |
df5f7be4ea078f423d0faf675c794bb30c4e8fe29aeaaa33964a82cd866b6df8
|
File details
Details for the file automated_document_parser-0.1.6-py3-none-any.whl.
File metadata
- Download URL: automated_document_parser-0.1.6-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1140d82b84ddaf151beae759d7ea243c77bf00b828d45837d7dcfd98c13f9c26
|
|
| MD5 |
086dd334d6eb83a97e55a636f53893fd
|
|
| BLAKE2b-256 |
6860fc8353aab2a1e5df398638f1867b7cd5c3e89882c4e0d1aada1e8dfd3e44
|