A LangChain integration for OpenDataLoader PDF
Project description
langchain-opendataloader-pdf
This package integrates the OpenDataLoader PDF engine with LangChain by providing a document loader which parses PDFs into structured Document objects.
Requirements
- Python >= 3.10
- Java 11 or newer available on the system
PATH - opendataloader-pdf >= 1.3.0
Installation
pip install -U langchain-opendataloader-pdf
Quick start
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path=["path/to/document.pdf", "path/to/folder"],
format="text"
)
documents = loader.load()
for doc in documents:
print(doc.metadata, doc.page_content[:80])
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_path |
List[str] |
✅ Yes | — | One or more PDF file paths or directories to process. |
format |
str |
No | None |
Output formats (e.g. "json", "html", "markdown", "text"). |
quiet |
bool |
No | False |
Suppresses CLI logging output when True. |
content_safety_off |
Optional[List[str]] |
No | None |
List of content safety filters to disable (e.g. "all", "hidden-text", "off-page", "tiny", "hidden-ocg"). |
Development workflow
This repository uses Poetry for dependency management. If you don't have Poetry installed, please follow the official installation guide.
Once Poetry is installed, you can install the project dependencies:
poetry install --with dev
Common tasks are mirrored in the Makefile so you can run them with or without Poetry.
Quality checks
make lint # ruff + mypy
make test # unit test suite (network disabled)
make integration_tests # runs tests that may touch the network
You can also call the underlying Poetry commands directly (e.g., poetry run pytest).
Note for Windows Users:
If the make command is not available on your system, you can run the quality checks using the following commands directly:
- Linting:
poetry run ruff check . poetry run mypy .
- Unit Tests:
poetry run pytest --disable-socket --allow-unix-socket
- Integration Tests:
poetry run pytest
Publishing notes
Run poetry check and poetry build to verify the package metadata before uploading to PyPI. Confirm that langchain_opendataloader_pdf/py.typed is present in the wheel so consumers benefit from typing information.
License
Distributed under the MIT License. See LICENSE for full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_opendataloader_pdf-1.0.1.tar.gz.
File metadata
- Download URL: langchain_opendataloader_pdf-1.0.1.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bae710260381d0ef4aae171a4712799f8409162500a9032984ad6747c943dbd0
|
|
| MD5 |
d8bc0a9ec2acb39c3594fc6e3595fcdf
|
|
| BLAKE2b-256 |
60cf473a2bdbfe8917f19eae5274b409ef0bff5b2fa005b0c90ada5f67ff3e81
|
Provenance
The following attestation bundles were made for langchain_opendataloader_pdf-1.0.1.tar.gz:
Publisher:
release.yml on opendataloader-project/langchain-opendataloader-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_opendataloader_pdf-1.0.1.tar.gz -
Subject digest:
bae710260381d0ef4aae171a4712799f8409162500a9032984ad6747c943dbd0 - Sigstore transparency entry: 756618750
- Sigstore integration time:
-
Permalink:
opendataloader-project/langchain-opendataloader-pdf@056211c33b309ad7946f5ef27da609d215036f2e -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@056211c33b309ad7946f5ef27da609d215036f2e -
Trigger Event:
push
-
Statement type:
File details
Details for the file langchain_opendataloader_pdf-1.0.1-py3-none-any.whl.
File metadata
- Download URL: langchain_opendataloader_pdf-1.0.1-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c8bbb18d94291c505fe343cb48430c68c17c8e8177ed0c40ce23ca92ba100ac
|
|
| MD5 |
e592f7e9a0e7cc4c09c6d6ef0eb7d9b4
|
|
| BLAKE2b-256 |
caf635377eac2b87378ecf2863586ef6420412b1a01065dfda5568148b277b08
|
Provenance
The following attestation bundles were made for langchain_opendataloader_pdf-1.0.1-py3-none-any.whl:
Publisher:
release.yml on opendataloader-project/langchain-opendataloader-pdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_opendataloader_pdf-1.0.1-py3-none-any.whl -
Subject digest:
7c8bbb18d94291c505fe343cb48430c68c17c8e8177ed0c40ce23ca92ba100ac - Sigstore transparency entry: 756618757
- Sigstore integration time:
-
Permalink:
opendataloader-project/langchain-opendataloader-pdf@056211c33b309ad7946f5ef27da609d215036f2e -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/opendataloader-project
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@056211c33b309ad7946f5ef27da609d215036f2e -
Trigger Event:
push
-
Statement type: