Skip to main content

A LangChain integration for OpenDataLoader PDF

Project description

langchain-opendataloader-pdf

This package integrates the OpenDataLoader PDF engine with LangChain by providing a document loader which parses PDFs into structured Document objects.

Requirements

  • Python >= 3.10
  • Java 11 or newer available on the system PATH
  • opendataloader-pdf >= 1.3.0

Installation

pip install -U langchain-opendataloader-pdf

Quick start

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["path/to/document.pdf", "path/to/folder"], 
    format="text"
)
documents = loader.load()

for doc in documents:
    print(doc.metadata, doc.page_content[:80])

Parameters

Parameter Type Required Default Description
file_path List[str] ✅ Yes One or more PDF file paths or directories to process.
format str No None Output formats (e.g. "json", "html", "markdown", "text").
quiet bool No False Suppresses CLI logging output when True.
content_safety_off Optional[List[str]] No None List of content safety filters to disable (e.g. "all", "hidden-text", "off-page", "tiny", "hidden-ocg").

Development workflow

This repository uses Poetry for dependency management. If you don't have Poetry installed, please follow the official installation guide.

Once Poetry is installed, you can install the project dependencies:

poetry install --with dev

Common tasks are mirrored in the Makefile so you can run them with or without Poetry.

Quality checks

make lint      # ruff + mypy
make test      # unit test suite (network disabled)
make integration_tests  # runs tests that may touch the network

You can also call the underlying Poetry commands directly (e.g., poetry run pytest).

Note for Windows Users:

If the make command is not available on your system, you can run the quality checks using the following commands directly:

  • Linting:
    poetry run ruff check .
    poetry run mypy .
    
  • Unit Tests:
    poetry run pytest --disable-socket --allow-unix-socket
    
  • Integration Tests:
    poetry run pytest
    

Publishing notes

Run poetry check and poetry build to verify the package metadata before uploading to PyPI. Confirm that langchain_opendataloader_pdf/py.typed is present in the wheel so consumers benefit from typing information.

License

Distributed under the MIT License. See LICENSE for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_opendataloader_pdf-1.0.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_opendataloader_pdf-1.0.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_opendataloader_pdf-1.0.0.tar.gz.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5f2915bd8f5797297f796b1f0988875139efcc37033ab089fe1388d75994a30b
MD5 2dfb731e8769263e91fb0f9b62d1afad
BLAKE2b-256 f243ea3131e919ce04beb7ff9a34678a34d203b9afb4873c04962d23eb6fe750

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-1.0.0.tar.gz:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langchain_opendataloader_pdf-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_opendataloader_pdf-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eafa0a950969bc5754420e4c8060f8d63cc042df96c330c8d2daf20670c3038c
MD5 dbc313e70e37970cc742a5986f14be03
BLAKE2b-256 57442241879ae6f0e35f8d824f1e4070a1128ece84123f891b09fad9378afb75

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_opendataloader_pdf-1.0.0-py3-none-any.whl:

Publisher: release.yml on opendataloader-project/langchain-opendataloader-pdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page