Skip to main content

A LangChain integration for OpenDataLoader PDF

Project description

langchain-opendataloader-pdf

This package integrates the OpenDataLoader PDF engine with LangChain by providing a document loader which parses PDFs into structured Document objects.

Requirements

  • Python >= 3.10
  • Java 11 or newer available on the system PATH
  • opendataloader-pdf >= 1.3.0

Installation

pip install -U langchain-opendataloader-pdf

Quick start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"], 
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])

Parameters

Parameter Type Required Default Description
file_path List[str] ✅ Yes One or more PDF file paths or directories to process.
format str No None Output formats (e.g. "json", "html", "markdown", "text").
quiet bool No False Suppresses CLI logging output when True.
content_safety_off Optional[List[str]] No None List of content safety filters to disable (e.g. "all", "hidden-text", "off-page", "tiny", "hidden-ocg").

Development workflow

This repository uses Poetry for dependency management. If you don't have Poetry installed, please follow the official installation guide.

Once Poetry is installed, you can install the project dependencies:

poetry install --with dev

Common tasks are mirrored in the Makefile so you can run them with or without Poetry.

Quality checks

make lint      # ruff + mypy
make test      # unit test suite (network disabled)
make integration_tests  # runs tests that may touch the network

You can also call the underlying Poetry commands directly (e.g., poetry run pytest).

Note for Windows Users:

If the make command is not available on your system, you can run the quality checks using the following commands directly:

  • Linting:
    poetry run ruff check .
    poetry run mypy .
    
  • Unit Tests:
    poetry run pytest --disable-socket --allow-unix-socket
    
  • Integration Tests:
    poetry run pytest
    

Publishing notes

Run poetry check and poetry build to verify the package metadata before uploading to PyPI. Confirm that langchain_opendataloader_pdf/py.typed is present in the wheel so consumers benefit from typing information.

License

Distributed under the MIT License. See LICENSE for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_opendataloader_pdf-1.0.1.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_opendataloader_pdf-1.0.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_opendataloader_pdf-1.0.1.tar.gz.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bae710260381d0ef4aae171a4712799f8409162500a9032984ad6747c943dbd0
MD5 d8bc0a9ec2acb39c3594fc6e3595fcdf
BLAKE2b-256 60cf473a2bdbfe8917f19eae5274b409ef0bff5b2fa005b0c90ada5f67ff3e81

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-1.0.1.tar.gz:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langchain_opendataloader_pdf-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c8bbb18d94291c505fe343cb48430c68c17c8e8177ed0c40ce23ca92ba100ac
MD5 e592f7e9a0e7cc4c09c6d6ef0eb7d9b4
BLAKE2b-256 caf635377eac2b87378ecf2863586ef6420412b1a01065dfda5568148b277b08

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-1.0.1-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page